Synthesis and Verification of Digital Circuits using Functional Simulation and Boolean Satisfiability. by Plaza, Stephen M.





A dissertation submitted in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
(Computer Science and Engineering)
in The University of Michigan
2008
Doctoral Committee:
Associate Professor Igor L. Markov, Co-Chair
Assistant Professor Valeria M. Bertacco, Co-Chair
Professor John P. Hayes
Professor Karem A. Sakallah
Associate Professor Dennis M. Sylvester
c© Stephen M. Plaza 2008
All Rights Reserved
To my family, friends, and country
ii
ACKNOWLEDGEMENTS
I would like to thank my advisers, Professor Igor Markov and Professor Valeria Bertacco,
for inspiring me to consider various fields of research and providing feedback on my
projects and papers. I also want to thank my defense committee for their comments and in-
sights: Professor John Hayes, Professor Karem Sakallah, and Professor Dennis Sylvester.
I would like to thank Professor David Kieras for enhancing myknowledge and apprecia-
tion for computer programming and providing invaluable advice.
Over the years, I have been fortunate to know and work with several wonderful stu-
dents. I have collaborated extensively with Kai-hui Chang ad Smita Krishnaswamy and
have enjoyed numerous research discussions with them and have benefited from their in-
sights. I would like to thank Ian Kountanis and Zaher Andrausfor our many fun discus-
sions on parallel SAT. I also appreciate the time spent collab r ting with Kypros Constan-
tinides and Jason Blome. Although I have not formally collabor ted with Ilya Wagner, I
have enjoyed numerous discussions with him during my doctoral studies. I also thank my
office mates Jarrod Roy, Jin Hu, and Hector Garcia.
Without my family and friends I would never have come this far. I would like to thank
Geoff Blake and Smita Krishnaswamy, who have been both good friends and colleagues
and who have talked to me often when the stress at work was overwhelming. I also want
to thank Geoff for his patience being my roommate for so many years. I am blessed to
iii
also have several good friends outside of the department whohave provided me a lot of
support: Steve Kibit (Steve2 representin’), Rob Denis, and Jen Pileri.
Most of all, I would like to thank my family who has been an emotional crutch for me.
My mom, dad, and brother Mark have all been supportive of my decision to go for a PhD
and have continuously encouraged me to strive for excellence.
iv
PREFACE
The semiconductor industry has long relied on the steady trend of transistor scaling,
that is, the shrinking of the dimensions of silicon transistor devices, as a way to improve
the cost and performance of electronic devices. However, several design challenges have
emerged as transistors have become smaller. For instance, wires are not scaling as fast as
transistors, and delay associated with wires is becoming more significant. Moreover, in
the design flow for integrated circuits, accurate modeling of wire-related delay is available
only toward the end of the design process, when the physical placement of logic units
is known. Consequently, one can only know whether timing performance objectives are
satisfied,i.e., if timing closure is achieved, after several design optimizations. Unless
timing closure is achieved, time-consuming design-flow iterations are required. Given the
challenges arising from increasingly complex designs, failing to quickly achieve timing
closure threatens the ability of designers to produce high-performance chips that can match
continually growing consumer demands.
In this dissertation, we introduce powerful constraint-guided synthesis optimizations
that take into account upcoming timing closure challenges and eliminate expensive de-
sign iterations.In particular, we use logic simulation to approximate the behavior of in-
creasingly complex designs leveraging a recently proposedconcept, calledbit signatures,
which allows us to represent a large fraction of a complex circuit’s behavior in a com-
v
pact data structure.By manipulating these signatures, we can efficiently discover a greater
set of valid logic transformations than was previously possible and, as a result, enhance
timing optimization. Based on the abstractions enabled through signatures,we propose
a comprehensive suite of novel techniques: (1) a fast computation of circuit don’t-cares
that increases restructuring opportunities, (2) a verificat on methodology to prove the cor-
rectness of speculative optimizations that efficiently utilizes the computational power of
modern multi-core systems, and (3) a physical synthesis strategy using signatures that
re-implements sections of a critical path while minimizingperturbations to the existing
placement. Our results indicate that logic simulation is effective in approximating the be-




DEDICATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
PREFACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . .xviii
PART
I Introduction and Background . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter I. Introduction: Overcoming Challenges in Nanometer Design . . 1
1.1 Trends in the Electronics and EDA Industries . . . . . . . . . .. . . 1
1.2 Challenges in High-Performance Integrated Circuit Design . . . . . . 3
1.3 Bridging the Gap between Logic and Physical Optimizations . . . . . 7
1.4 Using Simulation-based Abstractions for Circuit Optimizations . . . . 8
1.5 Components of Our Simulation-based Framework. . . . . . . . . . . 10
1.6 Organization of the Dissertation . . . . . . . . . . . . . . . . . . .. 11
Chapter II. Synergies between Synthesis, Verification, andFunctional Sim-
ulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1 Scalable Verification . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.1.1 Satisfiability . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2 Previous Parallel SAT Approaches . . . . . . . . . . . . . . . 20
2.2 Scalable Logic Synthesis . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2.1 Don’t Care Analysis . . . . . . . . . . . . . . . . . . . . . . 23
vii
2.2.2 Logic Rewriting . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.3 Physically-aware Synthesis . . . . . . . . . . . . . . . . . . . 26
2.3 Logic Simulation and Bit Signatures . . . . . . . . . . . . . . . . .. 27
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Chapter III. Challenges to Achieving Design Closure. . . . . . . . . . . . 29
3.1 Physical Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 Advances in Integrated Circuit Design. . . . . . . . . . . . . . . . . 35
3.3 Limitations of Current Industry Solutions . . . . . . . . . . .. . . . 38
3.4 Contributions of the Dissertation . . . . . . . . . . . . . . . . . . 39
II Improving the Quality of Functional Simulation . . . . . . . . . . . . . . . 41
Chapter IV. High-coverage Functional Simulation . . . . . . . . . . . . . . 42
4.1 Improving Verification Coverage through Automated Constrained-Random
Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2 Finding Inactive Parts of a Circuit . . . . . . . . . . . . . . . . . . 46
4.2.1 Toggle Activity of a Signal . . . . . . . . . . . . . . . . . . . 46
4.2.2 Toggle Activity of Multiple Bits . . . . . . . . . . . . . . . . 48
4.3 Targeted Re-simulation . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1 Random Simulation with SAT . . . . . . . . . . . . . . . . . 52
4.3.2 Partition-Targeted Simulation . . . . . . . . . . . . . . . . . .56
4.4 Empirical Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Chapter V. Enhancing Simulation-based Abstractions with Don’t Cares . 64
5.1 Encoding Don’t Cares in Signatures . . . . . . . . . . . . . . . . . .65
5.2 Global ODC Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.2.1 Approximate ODC Simulator . . . . . . . . . . . . . . . . . . 67
5.2.2 False Positives and False Negatives . . . . . . . . . . . . . . .69
5.2.3 Analysis and Approximation of ODCs . . . . . . . . . . . . . 70
5.2.4 Performance of Approximate Simulator . . . . . . . . . . . . 73
5.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 75
III Improving the Efficiency of Formal Equivalence Checking . . . . . . . . 76
Chapter VI. Incremental Verification with Don’t Cares . . . . . . . . . . . 77
6.1 Verifying Signature Abstractions . . . . . . . . . . . . . . . . . . 78
viii
6.2 Incremental Equivalence Checking up to Don’t Cares . . . .. . . . . 81
6.2.1 Moving-dominator Equivalence Checker . . . . . . . . . . . .81
6.2.2 Verification Algorithm . . . . . . . . . . . . . . . . . . . . . 82
6.2.3 Calculating Dominators . . . . . . . . . . . . . . . . . . . . . 84
6.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Chapter VII. Multi-threaded SAT Solving . . . . . . . . . . . . . . . . . . 87
7.1 Parallel-processing Methodologies in EDA . . . . . . . . . . .. . . . 87
7.2 Runtime Variability in SAT Solving . . . . . . . . . . . . . . . . . . 91
7.3 Scheduling SAT Instances of Varying Difficulty . . . . . . . .. . . . 93
7.4 Current Parallel SAT Solvers . . . . . . . . . . . . . . . . . . . . . . 96
7.5 Solving Individual Hard Instances in Parallel . . . . . . . .. . . . . 97
7.5.1 Search Space Partitioning . . . . . . . . . . . . . . . . . . . . 98
7.5.2 Lightweight Parallel SAT . . . . . . . . . . . . . . . . . . . . 100
7.6 Empirical Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 103
7.6.1 Effective Scheduling of SAT Instances . . . . . . . . . . . . .103
7.6.2 Solving Individual Hard Problems . . . . . . . . . . . . . . . 105
7.6.3 Partitioning Strategies . . . . . . . . . . . . . . . . . . . . . . 108
7.6.4 Parallel Learning Strategies . . . . . . . . . . . . . . . . . . . 109
7.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 110
IV Improving Logic and Physical Synthesis . . . . . . . . . . . . . . . . . . 111
Chapter VIII. Signature-based Manipulations . . . . . . . . . . . . . . . . 112
8.1 Logic Transformations through Signature Manipulations. . . . . . . 112
8.2 ODC-enhanced Node Merging . . . . . . . . . . . . . . . . . . . . . 113
8.2.1 Identifying ODC-based Node Mergers . . . . . . . . . . . . . 115
8.2.2 Empirical Validation . . . . . . . . . . . . . . . . . . . . . . 117
8.3 Determining Logic Feasibility with Signatures . . . . . . . . . . . 125
8.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Chapter IX. Path-based Physical Resynthesis using Functional Simulation 136
9.1 Logic Restructuring for Timing Applications . . . . . . . . .. . . . . 138
9.2 Identifying Non-monotone Paths . . . . . . . . . . . . . . . . . . . .139
9.2.1 Path Monotonicity . . . . . . . . . . . . . . . . . . . . . . . 139
9.2.2 Calculating Non-monotone Factors . . . . . . . . . . . . . . . 141
9.3 Physically-aware Logic Restructuring . . . . . . . . . . . . . .. . . 145
9.3.1 Subcircuit Extraction . . . . . . . . . . . . . . . . . . . . . . 145
9.3.2 Physically-guided Topology Construction . . . . . . . . .. . 146
ix
9.4 Enhancing Resynthesis through Global Signature Matching . . . . . . 149
9.5 Empirical Validation . . . . . . . . . . . . . . . . . . . . . . . . . . 150
9.5.1 Prevalence of Non-monotonic Interconnect . . . . . . . . .. 151
9.5.2 Physically-aware Restructuring . . . . . . . . . . . . . . . . .152
9.5.3 Comparison with Redundancy Addition and Removal . . . .. 155
9.6 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Chapter X. Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
10.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 159
10.2 Directions for Future Research . . . . . . . . . . . . . . . . . . . .. 160
INDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162




1.1 Transistors manufactured on a single chip over several gnerations of
Intel CPUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Transistor scaling projected at future technology nodes. . . . . . . . . . 4
1.3 Major components of multilayer interconnect: single-layer wire seg-
ments and inter-layer connectors (vias). . . . . . . . . . . . . . . .. . 4
1.4 Typical integrated circuit design flow.The design flow starts from an
initial design specification. Several optimization steps are performed,
and then a final chip is manufactured.. . . . . . . . . . . . . . . . . . 6
2.1 Pseudo-code of the search procedure used in DPLL-SAT. The procedure
terminates when it either finds a satisfying assignment or prves that no
such solution exists.. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 An example conflict graph that is the result of the last twoclauses in
the list conflicting with the current assignment. We show twopotential
learnt clauses that can be derived from the illustrated cuts. The dotted
line closest to the conflict represents the 1-UIP cut, and theo r is the
2-UIP cut. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3 Satisfiability don’t-cares (SDCs) and observability don’t-cares (ODCs).
a) An example of an SDC. b) An example of an ODC. . . . . . . . . . . 24
2.4 ODCs are identified for an internal nodea in a netlist by creating a mod-
ified copy of the netlist wherea is inverted and then constructing a miter
for each corresponding output. The set of inputs for which the miter
evaluates to 1 corresponds to the care-set of that node.. . . . . . . . . . 24
xi
2.5 Two examples of AIG rewriting.In the first example, rewriting results in
a subgraph with less nodes than the original. Through structu al hashing,
external nodes are reused to reduce the size of the subgraph,as s own in
the second example.. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.1 Delay trends from ITRS 2005. As we approach the 32nm technology
node, global and local interconnect delay become more significa t com-
pared to gate delay. . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Topology construction and buffer assignment [43]. Part a) shows the
initial topology and part b) shows an embedding and buffer assignment
for that topology that accounts for the time criticality ofb. In part c), a
better topology is considered whose embedding and buffer assignment
improves the delay forb. . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3 Logic restructuring. The routing of signalwith late arrival time shown
in part a) can be optimized by connectinga to a substitute signal with
earlier arrival time as in part b). In this example, the output of the gate
AND(b,c) is a resynthesis ofa. . . . . . . . . . . . . . . . . . . . . . . 34
3.4 Evolution of the digital design flow to address design closure challenges
due to the increasing dominance of wire delay. a) Design flow with
several discrete steps. b) Improved design flow using physical ynthesis
and refined timing estimates to achieve timing closure more reliably. c)
Modern design flow where logic and physical optimization stages are
integrated to leverage better timing estimates earlier in the flow.. . . . . 36
4.1 Our Toggle framework automatically identifies the components of a cir-
cuit that are poorly stimulated by random simulation and generates input
vectors targeting them.. . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 The entropy of each bit for an 8-bit bidirectional counter after 100, 1000,
and 10000 random simulation vectors are applied is shown in part a).
Part b) shows the entropy achieved after 100, 200, and 300 guided vec-
tors are applied after initially applying 100 random vectors. . . . . . . . 48
4.3 a) XOR constraints are added to reduce the solution spaceof a SAT
instanceC∗, which is sparser than the solution space ofC. b) Component
A is targeted for simulation, so that itsm inputs are evenly sensitized
within circuitC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.4 Partition simulation algorithm. . . . . . . . . . . . . . . . . . . .. . . 58
xii
5.1 Example of our ODC representation for a small circuit. For clarity, we
only show ODC information for nodec (not shown is the downstream
logic determining those don’t-cares). For the other internal nodes, we
report only their signatureS. When examining the first four simulation
patterns, nodeb is a candidate for merging with nodec up to ODCs.
Further simulation indicates that an ODC-enabled merger isnot possible. 66
5.2 Efficiently generating ODC masks for each node. . . . . . . . .. . . . 68
5.3 Example of a false negative generated by our approximateODC simula-
tor due to reconvergence.S∗ andSare shown for all internal nodes; only
S is shown for the primary inputs and outputs. . . . . . . . . . . . . . .69
6.1 An example that shows how to prove that nodeg can implement node
f in the circuit. a) A miter is constructed betweenf andg to check for
equivalence, but it does not account for ODCs because the logic in the
fanout cone off is not considered. b) A dominator set can be formed in
the fanout cone of and miters can be placed across the dominators to
account for ODCs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2 Determining whether two nodes are equivalent up to ODCs.. . . . . . . 84
7.1 High-level flow of our concurrent SAT methodology. We introduce a
scheduler for completing a batch of SAT instances of varyingcomplexity
and a lightweight parallel strategy for handling the most complex instances. 88
7.2 Number of SAT instances solved vs. time for the SAT 2003 colle tion.
The timeout is 64 minutes. . . . . . . . . . . . . . . . . . . . . . . . . 94
7.3 Percentage of total restarts for each minute of execution for a random
sample of instances from the SAT 2003 collection. . . . . . . . . .. . 95
7.4 Parallel SAT Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7.5 The number of SAT instances solved (within the time allowed) by con-
sidering three different scheduling schemes for an 8-threaded machine.
Our priority scheme gives the best average latency, which is20% better
thanbatch mode and 29% better thantime-slice mode. . . . . 105
xiii
7.6 a) The percentage of satisfiable instances where the firstthread that com-
pletes finds a satisfying assignment. b) The standard deviation of run-
time between threads. Using XOR constraints as opposed to spli ting
one variable can significantly improve load balance and moreevenly dis-
tribute solutions among threads. . . . . . . . . . . . . . . . . . . . . . 107
7.7 The effectiveness of sharing learnt clauses by choosingthe most active
learnt clauses compared to the smallest learnt clauses. . . .. . . . . . . 109
9.1 The resynthesis of a non-monotone path can produce much shorter criti-
cal paths and improve routability. . . . . . . . . . . . . . . . . . . . . .137
9.2 Improving delay through logic restructuring. In our solution, we first
identify the most promising regions for improvements, and then we re-
structure them to improve delay. Such netlist transformations include
gate cloning, but are also substantially more general. Theydo not re-
quire for the transformed subcircuits to be equivalent to the original one.
Instead, they use simulation and satisfiability to ensure that t e entire
circuit remains equivalent to the original. . . . . . . . . . . . . . . . 138
9.3 Computing the non-monotone factor fork-hop paths. . . . . . . . . . . 140
9.4 Calculating the non-monotone factor for path{d,h}. The matrix shows
sub-computations that are performed while executing the algorithm in
Figure 9.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
9.5 Our flow for restructuring non-monotone interconnect. We extract a sub-
circuit selected by our non-monotone metric and search for alte native
equivalent topologies using simulation. The new implementations are
then considered based on the improvement they bring and verified to be
equivalent with an incremental SAT solver. . . . . . . . . . . . . . .. 142
9.6 Extracting a subcircuit for resynthesis from a non-monot e path. . . . 143
9.7 Signatures and topology constraints guide logic restructu ing to improve
critical path delay. The figure shows the signatures for the inputs and
output of the topology to be derived. Each table represents the PBDs of
the outputF that are distinguished. The topology that connectsa andb
directly with a gate is infeasible because it does not preserv ssential
PBDs ofa andb. A feasible topology usesb andc, followed bya. . . . 144
9.8 Restructuring non-monotone interconnect. . . . . . . . . . .. . . . . . 147
xiv
9.9 The graph plots the percentage of paths whose NMF is belowthe cor-
responding value indicated on the x-axis. Notice that longer paths tend
to be non-monotone and at least 1% of paths are> 5 times the ideal
minimal length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
9.10 The graph above illustrates that the largestactual delay improvements
occur at portions of the critical path with the largest imatedgain us-
ing our metric. The data points are accumulated gains achieved by 400




4.1 Generating even stimuli through random XOR constraintsfor the 14
inputs of alu4. We normalize the entropy seen along the inputs by
log2(#simvectors), so that 1.0 is the highest entropy possible. . . . . . . 60
4.2 Entropy analysis on partitioned circuits, the number of newinput com-
binations found and the percentage of entropy increase after adding 32
guided input vectors versus 32 random ones.. . . . . . . . . . . . . . . 61
4.3 Comparing SAT-based re-simulation with random re-simulation over a
partition for generating 32 vectors. The time-out is 10000 seconds. . . . 62
5.1 Comparisons between related techniques to expose circuit don’t-cares.
Our solution can efficiently derive both global SDCs and ODCs. . . . . 65
5.2 Efficiency of the approximate ODC simulator. . . . . . . . . . .. . . . 74
5.3 Runtime comparison between techniques from [95] and ourgl bal sim-
ulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.1 MiniSAT 2 results on the SAT 2003 benchmark suite. . . . . . .. . . . 103
7.2 Running MiniSAT on a set of benchmarks of similar complexity using a
varying number of threads. . . . . . . . . . . . . . . . . . . . . . . . . 104
7.3 Hard SAT instances solved using 8 threads of computationw th a port-
folio of solvers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
7.4 Hard SAT instances solved using 4 threads of computationw th a port-
folio of solvers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
xvi
8.1 Evaluation of our approximate ODC simulator in finding node mrger
candidates: we show the total number of candidates after generati g
2048 random input patterns and report the percentage of false po itives
and negatives. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8.2 Area reductions achieved by applying the ODC merging algorithm after
ABC’s synthesis optimization [62]. The time-out for the algorithm was
set to 5000 seconds. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
8.3 Gate reductions and performance cost of the ODC-enhanced node-merging
algorithm when applied to circuits synthesized with DesignCompiler
[104] in high-effort mode. The merging algorithm runtime isbound to13
of the corresponding runtime in DesignCompiler. . . . . . . . . .. . . 121
8.4 Percentage of mergers that can be detected by considering only K levels
of logic, for various K. . . . . . . . . . . . . . . . . . . . . . . . . . . 122
8.5 Comparison with circuit unrolling. Percentage of totalmergers exposed
by the local ODC algorithm (K=5) for varying unrolling depths. . . . . 122
8.6 Statistics for the ODC merging algorithm on unsynthesized circuits. The
table reports the SAT success rate in validating merger candid tes and the
number of SAT calls that could be avoided because of the use ofdynamic
simulation vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
9.1 Significant delay improvement is achieved using our path-based logic
restructuring. Delay improvement is typically accompanied by only a
small wirelength increase.. . . . . . . . . . . . . . . . . . . . . . . . 153
9.2 Effectiveness of our approach compared to RAR. . . . . . . . .. . . . 155
xvii
LIST OF ABBREVIATIONS
2-SAT 2-SATisfiability (Satisfiability instance where each
clause has at most two literals)




AMD Advanced Micro Devices
AT Arrival Time
ATPG Automatic Test Pattern Generation
BCP Boolean Constraint Propagation
BDD Binary Decision Diagram
C(b) Care set ofb
CAD Computer Aided Design
CNF Conjunctive Normal Form
CODC Compatible Observability Don’t Care
CPU Central Processing Unit
D2M Delay with 2 Moments
DC Design Compiler (Synopsys synthesis tool) or
Don’t Care
DC(b) Don’t Care set ofb
DPLL Davis-Putnam-Logemann-Loveland (satisfiability
algorithm)
EDA Electronic Design Automation
FLUTE A software package from Iowa State University
implementing fast RSMT construction. It is based
on lookup tables.
FPGA Field-Programmable Gate Array
GB GigaByte
GHz GigaHertz
GSRC Gigascale Systems Research Center
GTECH Generic TECHnology library (Synopsys)
HPWL Half-Perimeter WireLength
IC Integrated Circuit
ITRS International Technology Roadmap for Semiconductors
xviii
IWLS International Workshop on Logic and Synthesis
MOSFET Metal-Oxide-Semiconductor Field-Effect Transistor
MUX MUltipleXer
NMF Non-Monotonic Factor
ODC Observability Don’t Care
OFFSET(b) Set of input combinations whereb = 0
ONSET(b) Set of input combinations whereb = 1
OS Operating System
PBD Pairs of Bits to be Distinguished
RAR Redundancy Addition and Removal
RSMT Rectilinear Steiner Minimal Tree
RTL Register Transfer Level
SAT SATisfiability (problem) or SATisfiable (solution)
SDC Satisfiability Don’t Care
SMP Symmetric MultiProcessing
SPFDs Sets of Pairs of Functions to be Distinguished
SSF Single-Stuck-at Fault
STA Static Timing Analysis
UIP Unique Implication Point
UNSAT UNSATisfiable
U-SAT Unique-SATisfiable problem (only one solution)





Introduction: Overcoming Challenges in Nanometer
Design
1.1 Trends in the Electronics and EDA Industries
The performance capabilities of computer chips continue toincrease rapidly. This, in
turn, is driving the technology evolution in many differentapplication domains, such as
gaming and scientific computing. A major impetus for this growth is consumer demand,
which seeks the smallest, fastest, and coolest devices. Consumer demand guides perfor-
mance objectives and pressures computer companies to meet ti -to-market expectations.
Failure to meet these expectations can result in the loss of competitive advantage. For ex-
ample, in 2006 Sony postponed the release date of the PlayStation 3 console by six months
due to technical problems, exposing Sony’s gaming market share to competing consoles
by Microsoft and Nintendo.
Many applications depend on the predictability of improvements to integrated circuits.
1
Figure 1.1: Transistors manufactured on a single chip over several generations of Intel
CPUs.
As shown in Figure 1.1, the number of transistors on a chip hasbeen steadily increasing
during the past 40 years. As a result, the Core 2 Duo CPU has almost one hundred times
more transistors than the Pentium CPU 14 years ago. These scaling (and performance)
trends have been made possible by advances in device manufacturing, which have resulted
in the fabrication of smaller transistors.Transistor sizes are determined by the minimum
size of a geometrical feature (usually a rectangle) that canbe manufactured at a given tech-
nology node. Figure 1.2 illustrates the decreasing transistor size, where the physical length
LGATE of the transistor’s gate is currently at 50nm and expected toshrink to 15nm as man-
ufacturing techniques continue to improve.This scaling trend was observed by Gordon
Moore in 1965, when he projected that the number of transistor that fit in an integrated
circuit would double every two years [64], corresponding toan exponential growth. With
more transistors, entire systems that were once implemented across a computer board, now
fit on a single chip. More recently, this trend has made it possible to pack multiple pro-
cessors in the same chip, so-called multi-core processors.Multi-core processors are now
2
mainstream in the mass-market desktop computing domain, unlocki g the performance
wall that single core microprocessors had reached, and allowing multiple applications to
run in parallel on the same desktop system.
1.2 Challenges in High-Performance Integrated Circuit Design
Ensuring continued performance improvements has become mor challenging as tran-
sistors reach the nanometer scale. First, the complexity ofintegrated circuits has already
exceeded the capability of designers to optimize1 and verify their functionality. In design
processes, verifying design correctness is a major component that affects time-to-market.
Also, buggy designs released to the consumer market can significantly impact revenue
as evidenced by Intel’s floating-point division bug [100] and, more recently, by a bug in
AMD’s quad-core Phenom processor [99]. Second, the miniaturization of transistors to
the atomic scale poses several challenges in terms of the variability in the manufacturing
process, which leads to unpredictable performance. Third,the scaling of wire intercon-
nect is not as pronounced as that of transistors. As transistor get faster and smaller, the
width of wires decreases at a slower rate, and the per-unit resistance of wires may in-
crease. Therefore, the advantages of having shorter wires are mitigated by the increase in
time that it takes to propagate a signal. Consequently, an increasing percentage of chip
area is necessary for wires, and the maximum clock frequencyis primarily determined by
wire lengths, rather than transistor switching time.Figure 1.3 illustrates the prevalence of
interconnect on multiple metal layers on a chip. In this figure, two metal layers are shown
with several wires and interlayer connectors calledvias. This interconnect overshadows
1In this dissertation,optimizeis often used to mean performing operations that improve some perfor-
mance characteristic.
3
the polysilicon gates, which are a component of MOSFETs (transistors).
Figure 1.2: Transistor scaling projected at future technology nodes.
Figure 1.3: Major components of multilayer interconnect: single-layer wire segments and
inter-layer connectors (vias).
Traditionally, a computer chip design entails a series of step from high-level concep-
tualization to final chip fabrication. It is this design flow (shown in Figure 1.4) that must
be able to address technology scaling.Starting from the top left of Figure 1.4, a design
team specifies the desired functionality of the chip. The design team identifies the chip’s
4
major components and designs each of them at a high level; this functionality may be ex-
pressed in a hardware description language, such asSystemC. Numerous optimizations are
facilitated by the design team through the use of automated software tools. Eventually, the
design description is translated into a register transfer level (RTL) description(top-right
corner of Figure 1.4), which specifies a design in more detail. Through a process called
logic synthesis, an RTL description is translated into a gate-level netlistas shown at the
bottom right of the figure. In order to simplify transistor-level layout, multiple gates are
mapped to pre-designedstandard cells. This process is calledtechnology mapping. At this
point, an area estimate can be made based on the number of transistors required for the
design. Also, one can estimate the fastest possible clock frequency for the chip based on
transistor switching, since the number of cells that occur between the design’s inputs and
outputs is known.After a netlist is mapped to a set of cells, placement is performed.Dur-
ing placement, a physical location is given to each cell such that they do not overlap, and
then wires connecting cells are routed. Finally, both wiresand transistors are represented
by polygons, and the resulting design description is sent for fabricationto obtain the final
product, shown at the end of the flow in Figure 1.4.
Functional and physical verification are needed throughoutthe design flow. After per-
forming RTL and netlist-level optimization, the design’s outputs are checked against the
expected behavior of the design. Physical verification ensures thatdesign rules(such
as maintaining minimum spacing between wires), as well as electrical and thermal con-
straints, are satisfied. Furthermore, at the end of the design flow before fabrication, the
performance characteristics of the design are checked against desired goals and objec-
tives. This process of meeting performance objectives is known as achieving design clo-
5
Figure 1.4: Typical integrated circuit design flow.The design flow starts from an initial
design specification. Several optimization steps are performed, and then a
final chip is manufactured.
sure. The process of ensuring that the circuit timing (delay) constraints are met is known
as achievingtiming closure.
Physical information about the design, such as cell locations and wire routes, is known
only at the end of the flow. As previously noted, delay associated with wires is becoming
more prominent, hence accurate timing estimates are known only when wire lengths are
determined after the routing phase. However, most functional design optimizations are
traditionally performed early at the RTL, as well as during lo ic synthesis and technology
mapping. After several of these optimization steps, placement and routing might produce
a modified design that no longer achieves timing closure. Hence, the inability to gather ac-
curate timing and performance estimations early in the design flow leads to less flexibility
in performing design optimizations. For instance, optimizng for a specific performance
6
metric after placement and routing, such as timing, can negatively affect the quality of
other performance metrics. To avoid this tension between various design goals, late de-
sign flow optimizations are normally limited.Hence more heavyweight optimizations may
necessitate the re-iteration of earlier steps in the designflow. In some cases, the number of
design iterations required to achieve timing closure is prohibitive. Multiple iterations in-
crease the turn-around-time, development costs, and time-to-market, while also resulting
in a design that might fail to meet original expectations.
1.3 Bridging the Gap between Logic and Physical Optimizations
This dissertation develops powerful and efficient logic transformations that are applied
late in the design flow to achieve timing closure. The transformations overcome the lim-
itations of current methodology by 1) extracting and exploiting more circuit flexibility to
improve performance late in the design flow and 2) minimizingthe negative impact to
other performance metrics. The goal is to eliminate costly design iterations and enable ef-
ficient use of multi-core processors, to overcome increaseddesign complexity and scaling
challenges.
To enable these transformations, our work leverages the princi le of abstractionby
temporarily discarding all aspects of circuit behavior notobserved during a fast bit-parallel
simulation. Under this abstraction, we can handle complex designs, pinpoint potential
logic transformations that may lead to improvements in the design, and assess the quality
of a wide range of transformations. In the following section, we discuss our abstraction
technique and its components in more detail.
7
1.4 Using Simulation-based Abstractions for Circuit Optimizations
Key to our approach is the use of logic simulation to approximate the behavior of each
node in a circuit through information known as abit signature[50]. The functionality of
a node in a circuit is defined by its truth table that specifies th node’s output for all input
combinations. A signature is a partial truth table selectedby a (usually small) subset of the
possible input combinations. Such a partial truth table canbe viewed as an abstracted rep-
resentation that can be exponentially smaller than a complete truth table, yet can accurately
guide optimizations as we show throughout this dissertation. Because of the efficiency of
logic simulation, approximating circuit behavior scales linearly with the number of nodes
in the circuit, and consequently it can tackle large circuits. While such signatures have
already been used in the literature, these pre-existing techniques suffer from a number of
limitations.
Summary of related work. The effectiveness of logic simulation has been demon-
strated in terms of its ability to distinguish different nodes in a circuit [50, 59]. Conse-
quently, signatures can be used in both logic optimization and verification. With respect to
verification, the correctness of a design can be ascertainedup to the abstraction by compar-
ing its output signature to the corresponding output of a functio ally correct design, also
known as agolden model. Design optimizations are also enabled by signatures becaus
equivalent nodes in a circuit can be merged to simplify the design [59]. Furthermore, the
signature representation is amenable to simple transformations [18], that can generate new
signatures and that, in turn, can be mapped to logic optimizations on the actual design.
Key aspects and limitations of signature-based abstractions. When signatures are
used, optimization and verification are correct only with resp ct to the abstraction. A for-
8
mal proof mechanism is often required to verify the correctnss of the abstraction. Formal
proof engines, such as SAT solvers, invariably have exponential worst-case runtimes. This
lack of scalability is particularly problematic as design complexity grows. Since formal
proof mechanisms are typically based on hard-to-parallelize algorithms, it is difficult to
efficiently utilize the resources offered by recent multi-core CPUs.
Generating high-quality signatures is paramount so as to avoid incorrect characteri-
zations and to minimize the number of invocations of expensive proof mechanisms.The
quality of a signature rests in its ability to capture both typical-case behaviors and im-
portant corner-behaviors, while being occasionally refined through formal techniques and
additional simulation patterns. In [59], signatures are refined to improve their distinguish-
ing capabilities in finding equivalent nodes in a design. Despit the efficiency of generat-
ing signatures, ensuring their highquality in very large designs with complex hierarchical
components is a major challenge. In this scenario, if a signature abstraction is desired
for a component far from the primary inputs of a design, the limited controllability of
this component undermines the quality of the signature generated and its ability to expose
interesting behavior in that component. Furthermore, previous works do not consider a
node’sdownstreamlogic information when characterizing its behavior with a signature,
and therefore fail to exploit logic flexibilities present inlarge designs.
Finally, a general methodologyfor performing design optimizations with signatures
has not yet been developed.As we show in this dissertation, signatures simplify the search
for logic transformations and thus facilitate powerful gate-level optimizations that involve
both logic and physical design aspects. Such optimizationsare known as physical synthe-
sis.However, conventional strategies for synthesis are inadequat in exploiting the runtime
9
savings and optimization potential of signature-based synthesis.This dissertation presents
the first generalized solution achieving this goal.
1.5 Components of Our Simulation-based Framework
To enable complex transformations that can be applied late in the design flow to
achieve timing closure, we introduce a series of improvements to signature-based ab-
stractions that overcome previous limitations. The new elem nts of our simulation-based
framework, developed throughout the dissertation, not only result in better optimizations,
but improve the quality of verification efforts. We now outline our major contributions:
• A high-coverage verification engine for stimulating a component deep within a hi-
erarchical design while satisfying constraints.Our strategy relies on a simulation
engine for high performance, while improving the fidelity ofsignatures and verifi-
cation coverage.
• An efficient linear-timedon’t-careanalysis to extract potential flexibility in synthe-
sizing each node of a circuit and to enhance the corresponding sig atures.
• A technique to improve the efficiency of the formal proof mechanism in verifying
the equivalence between a design and its abstraction enhanced by don’t-cares.
• A strategy to improve the efficiency of verifying abstractions by exploiting parallel
computing resources such as the increasingly prevalent multi-core systems.Har-
nessing these parallel resources is one mechanism to partially counteract the in-
creasing costs of verification and to enable deployment of our signature-based opti-
mizations on more complex designs.
10
• A goal-driven synthesis strategy that quickly evaluates different logic implementa-
tions leveraging signatures.
• A constraint-guided synthesis algorithm using signaturesto improve physical per-
formance metrics, such as timing.
1.6 Organization of the Dissertation
In this dissertation, we introduce several algorithmic comp nents that enhance our
signature-based abstraction. We then leverage these components in logic and physical
synthesis to enable powerful optimizations, where traditional techniques perform poorly.
Throughout the chapters of this dissertation, we graduallyextend the scope and power of
signature-based optimizations. In Part II, we propose techniques that enhance signatures
by generating better simulation vectors that activate parts of the circuit in a design and
by encoding information on logic flexibility in the signatures. In Part III, we develop
verification strategies that mitigate the runtime costs of verifying the correctness of the
abstraction. In Part IV, we utilize our enhanced signaturesand verification strategies to
enable design optimizations by manipulating these signatures.The rest of the dissertation
is structured as follows:
• For the remainder of Part I, we provide background material necessary to navigate
through this dissertation.
– Chapter II covers background in logic synthesis, verification, and logic simula-
tion. We outline recently-discovered synergies between thse tasks and explain
how this dissertation builds upon these synergies.
11
– Chapter III describes the evolution of the design flow to address timing closure
and to survey previous work in late design flow optimization.
• In Part II, we introduce strategies to improve the quality and strength of signatures.
– Chapter IV introduces the notion of entropy for identifyingparts of a design
that experience low simulation coverage. We then develop a strategy for im-
proving simulation of these regions using a SAT solver. Thisapproach is use-
ful for stimulating internal components in complex hierarchies. In particular,
it helps in exposing bugs in corner-case behaviors.
– Chapter V introduces a technique for the efficient extraction of global circuit
don’t-cares based on a linear-time analysis and encodes them in signatures.
• Part III introduces strategies to counteract the complexity of verifying signature-
based abstractions in increasingly large designs.
– Chapter VI describes an incremental approach to verify a given abstraction up
to the derived don’t-cares and to refine it by generating additional signatures.
– Chapter VII introduces techniques to address the growing complexity of for-
mal verification by exploiting the increasing availabilityof multi-core systems
which can execute several threads simultaneously. We develop a parallel-SAT
solving methodology that consists of a priority-based scheduler for handling
multiple problem instances of varying complexity in parallel and a lightweight
strategy for handling single instances of high complexity.
• Part IV introduces techniques for performing logic manipulations using signatures.
12
– Chapter VIII describes how signatures can be exploited to enabl powerful
synthesis transformations. In particular, we show how nodemerging up to
don’t-cares can greatly simplify a circuit. Then, we introduce a new general
approach for performing logic synthesis using signatures.
– Chapter IX proposes a path-based resynthesis algorithm that finds and short-
ens critical paths with wire bypasses. We apply our path-based resynthesis
after placement, when better timing estimates are available, nd we report sig-
nificant improvements,indicating that current design flows still leave many
optimization opportunities unexplored.
• The dissertation is concluded in Chapter X with a summary of contributions and an
outline of future research directions.
13
CHAPTER II
Synergies between Synthesis, Verification, and Functional
Simulation
In the traditional design flow for integrated circuits, logic synthesis and verification
play a critical role in ensuring that integrated circuit parts released to the market are func-
tionally correct and achieve the specified performance objectives. Logic synthesis gener-
ates circuit netlists and transforms them to improve area and delay characteristics. These
transformations are carried forward by software used by circuit designers.To ensure the
correctness of these transformations, along with custom-made optimizations, verification
is typically performed over multiple steps in the design flow, where the actual behavior
of the circuit is verified against the desired behavior.Traditionally, synthesis and verifica-
tion are considered separate and independent tasks; however, recent research [59, 63, 95]
has exposed a number of common traits and synergies between synthe is and verification.
Functional verification often involves algorithms whose worst-case runtime complexity
grows exponentially with design size. However, the design sze can be reduced through
synthesis optimizations, typically reducing the verification effort. Logic simulation has
also been employed to improve verification, and more recently, to enable netlist simplifi-
cations [95].
A major contribution of this dissertation is its in-depth exploration of synergies be-
14
tween synthesis and verification, as well as the gains that can be derived by integrating
the two tasks through simulation techniques. Our goal is to improve the quality of re-
sults and the scalability of both practices, which are continually challenged by increasing
design complexity. In particular, we introduce speculative transformations that require
verification, a major departure from traditional correct-by-synthesis techniques, typically
employed today. In the remainder of this chapter, we discussprevious work in verification,
synthesis and logic simulation, focusing on strategies to improve their scalability.
2.1 Scalable Verification
Verifying the functional correctness of a design is a critical aspect of the design flow.
Typically, comprehensive verification methodologies [96,7] are employed and require a
team of specialized verification engineers to construct test cases that exercise the function-
ality of the circuit. The output of this circuit is usually compared against an idealgolden
model. To reduce the demands on the verification engineer in exposing interesting design
behavior through test cases, input stimuli can be automatically refined or modified, leading
to improvement in the verification coverage. For example, atthe instruction level, Markov
models can be used [83] to produce instruction sequences that effectively stimulate certain
parts of the design. However, explicit monitors are necessary to guide this refinement, thus
still requiring detailed understanding of the design. At the gate-level, simulation can also
be refined [59] to help distinguish nodes, but this is primarily useful for equivalence check-
ing. The goal of all these procedures is to generate test cases that can expose corner-case
behavior in the circuit. In Chapter IV, we discuss how simulation coverage is improved
automatically, without requiring any detailed understanding of the design.
Because generating exhaustive test cases is infeasible andreleasing a buggy design
15
is undesirable, formal verification techniques can be used to achieve higher verification
coverage. However, the limited scalability of formal techniques is a major bottleneck in
handling increasingly complex designs. Therefore, a combination of intelligently cho-
sen test suites and formal techniques on small components isoften adopted to maximize
verification coverage.
One prominent formal proof mechanism particularly relevant to this work is equiv-
alence checking. In equivalence checking, the output response f a design is compared
against a golden model for all legal input combinations. If the response is always the
same, the designs are said to be equivalent. Often, binary decision diagrams (BDDs) can
be used to check the equivalence between two combinational circuits. A BDD [14] is
a data structure that can often efficiently represent a circuit in a canonical way, so that
checking equivalence means building this canonical form for both designs. However, the
number of nodes in a BDD can be exponential with respect to thenumber of inputs, thus
limiting the scalability of the approach. Satisfiability-based equivalence checking tech-
niques have been developed [13] as an alternative to BDDs. Depite having exponential
worst-case runtime, SAT-based techniques typically have low r memory requirements and
successfully extend to larger designs. Below we provide thebackground on satisfiability
necessary to navigate this dissertation. Then we examine previous attempts to scale the
performance of SAT solvers by exploiting multiple processing units concurrently.
2.1.1 Satisfiability
The SAT problem entails choosing an assignmentV for a set of variables that satis-
fies a Boolean equation or discovering that no such assignment exists [76]. The Boolean
equation is expressed in conjunctive normal form (CNF),F = (a+b′ +c)(d′ +e).... — a
16
conjunction ofclauses, where a clause is a disjunction of literals. A literal is a Boolean
variable or its complement. For instance,(a+b′+c) and(d′+e) are clauses, anda, b′, c,
d′, eare literals.
A Framework for Solving SAT
A common approach to solving SAT is based on the branch-and-bcktrack DPLL algo-
rithm [24]. Several innovations, such as non-chronological backtracking, conflict-driven
learning, and decision heuristics greatly improve upon this approach [65, 80, 88]. The es-














Figure 2.1:Pseudo-code of the search procedure used in DPLL-SAT. The procedure termi-
nates when it either finds a satisfying assignment or proves that no such solution
exists.
The search() function explores the decision tree until a satisfying assignment is
found or the entire solution space is traversed without finding any such assignment. The
decide() function selects the next variable for which a value is chosen. Many methods
exist for selecting this “decision” variable, such as the VSID (Variable State Indepen-
dent Decaying Sum) algorithm developed in Chaff [65]. VSIDSinvolves associating an
17
activity counter with each literal. Whenever a new learnt clause is generated (see below)
from conflict analysis, the counter of each literal in that clause is incremented while all
other variables undergo a small decrease. Thepropagate() function performs Boolean
Constant Propagation (BCP),i.e., it identifies the clauses that are still unsatisfied and for
which only one literal is still unassigned, and then assignsthe literal to the only value
that can satisfy the clause. If the decision assignment implies a contradiction or conflict,
theanalyze conflict() function produces alearnt clause, which records the cause
of the conflict to prevent the same conflicting sets of assignments. Thebacktrack()
function undoes the earlier assignments that contributed to the conflict. Periodically, the
search() function is restarted: all current assignments are undone,a d the search pro-
cess starts anew using a random factor in restarting the decision process so that different
parts of the search space are explored. Extensive empiricaldat shows the effectiveness
of restarts in boosting the performance of SAT solvers by mini izing the exploration in
computation-intensive search paths [9, 65].
Learning
In this part, we consider two types of learning performed to reduce SAT search space:
preprocessing and conflict-driven learning.
Preprocessing.The goal of preprocessing a SAT instance is to simplify the insta ce
by adding implied constraints that reduce propagation costs: by eliminating variables, by
adding symmetry breaking clauses, or by removing clauses that are subsumed by oth-
ers. Preprocessing has led to improved runtime in solving several instances, although the
computational effort sometimes outweighs the benefit. A recent preprocessor, SatELite
[28], achieves significant simplifications through the efficient implementations of variable
18
elimination and subsumption.
Conflict-driven learning. Dynamic learning is important to prevent repeated explo-
ration in similar parts of the search space. When a set of assignments results in a conflict,
the conflict analysis procedure in SAT determines the cause by analyzing aconflict graph.
In Figure 2.2, we show an example of a SAT instance and a set of assignments that result in
a conflicting assignment for variablev. Each decision (f = 1, g= 1, anda= 1) is depicted
by a different leaf node, andimplicationsof these decisions, are shown as internal nodes
in the graph. An implication occurs when a set of variable decisions forces an unassigned
variable to be assigned 0 or 1. Adecision levelis associated with each node (nodes at
the same level are denoted by the same color in Figure 2.2), which is the set of variable
assignments implied by a decision. For instance, the seconddecision level consists of the
second decision (g = 1), and the implicationsk = 0 andm= 0.
A learnt clause can be derived in a conflict graph from a cut that crosses every path from
the leaf decision values to the conflict exactly once. The nodes to the left of the cut are on
thereason sideand those on the right are on theconflict side. Cut 1 in the figure shows
the1-UIP (first unique implication point) cut,i.e., the cut closest to the conflict side. In
this cut, the reason side contains one node of the last decision level (nodea) that dominates
all other nodes at the same decision level and on the same (reason) side. The assignment
e= 1, f = 1,k = 0,m= 1 is determined to be in conflict and hence(e′ + f ′ +k+m) can
be added to the original CNF to prevent this assignment in thefuture. Cut 2 indicates
the 2-UIP cut where the reason side contains one node in the previous decision level (level
2) that dominates every other node in that decision level. Here, the 2-UIP learnt clause
(e′+ f ′ +g′) can be added.
19
Figure 2.2: An example conflict graph that is the result of thelast two clauses in the list
conflicting with the current assignment. We show two potential learnt clauses
that can be derived from the illustrated cuts. The dotted line closest to the
conflict represents the 1-UIP cut, and the other is the 2-UIP cut.
A learning strategy commonly employed adds only the 1-UIP learnt clause for every
conflict. Despite the possibility of using smaller learnt clauses that technically prune larger
parts of the search space, it has been empirically shown in [90] that 1-UIP learning is most
effective. In [27], 1-UIP was shown to be more effective at pruning the search space
because the corresponding backtracking more often satisfied learnt clauses generated by
other UIP cuts.
2.1.2 Previous Parallel SAT Approaches
To boost the performance of SAT solvers on increasingly prevalent parallel architec-
tures, parallel SAT solving strategies have explored coarse-g ain and fine-grain paralleliza-
tion. Fine-grain parallelization strategies target Boolean Constraint Propagation (BCP) —
the runtime bottleneck for most SAT solvers. In BCP, each variable assignment is checked
20
against all relevant clauses, and any implications are propagated. BCP can be parallelized
by dividing the clause database amongn different solvers so that BCP computation time
of each solver is approximately1n the original. Coarse-grain parallelization strategies typ-
ically involve assigning a SAT solver to different parts of the search space.
Fine-grain parallelization. The performance of fine-grain parallelization depends on
the partitioning of clauses among the solvers, where an ideal partition ensures an even dis-
tribution of BCP costs while minimizing the implications that need to be communicated
between each solver. This strategy also requires low-latency inter-solver communication
to minimize contention for system locks on general microprocessors, which can exacerbate
runtime performance. Therefore, fine-grain parallelization has been examined on special-
ized architectures [91] that can minimize communication bottlenecks. In [2, 92], signif-
icant parallelization was achieved by mapping a SAT instance to an FPGA and allowing
BCP to evaluate several clauses simultaneously. However, the flexibility and scalability of
this approach is limited, since each instance needs to be compiled to the specific FPGA
architecture (a non-trivial task), and conflict-driven learning is difficult to implement ef-
fectively in hardware because it requires dynamic data structu es.
Coarse-grain parallelization. The runtime of an individual problem can also be im-
proved with parallel computation by using a solver portfolio [38], where multiple SAT
heuristics are executed in parallel and the fastest heuristic de ermines the runtime for the
problem. A solver portfolio also offers a way of countering the variability that backtrack-
style SAT solvers experience on many practical SAT instances [39]. Because one heuristic
may perform better than another on certain types of problems, one can reduce the risk of
choosing the wrong heuristic by running both. Although parallelization here consists of
21
running multiple versions of the same problem simultaneously, if the runtime difference
between these heuristics is significant, a solver portfoliocan yield runtime improvements.
However, using a portfolio solver does not guarantee high resource utilization as each
heuristic may perform similarly on any given instance or oneheuristic may dominate the
others. The primary limitation of solver portfolios is thatthere is no good mechanism to
coordinate the efforts of these heuristics and the randomness i herent to them. Other ap-
proaches consider analyzing different parts of the search space in parallel [22, 55, 66, 89].
If the parts of the search space are disjoint, the solution tothe problem can be determined
through the processing of these parts in isolation. However, in practice, the similarities
that often exist between different parts of the search spacemean that redundant analysis
is performed across the different parts. To counter this, the authors in [55] develop an
approach to explore disjoint parts of the search space by relying on the shared memory of
multi-core systems to transfer learned information between th m. The approach considers
dividing the problem instance using different variable assignments calledguiding paths, as
originally described in [89]. One major limitation of this type of search space partitioning
is that poor partitions can produce complex sub-problems with dely varying structure
and complexity.
The benefits of learning between solvers working on different parts of the search
space in parallel suggest potential super-linear improvement. However, the improvements
achieved by current strategies seem more consistent with the inherent variability of solving
many real-world SAT problems and the effect of randomization on reducing this variabil-
ity. Through clever randomization strategies, sequentialsolvers can often avoid complex
parts of the search space and outperform their parallel counterparts.
22
2.2 Scalable Logic Synthesis
Due to the development of powerful multi-level synthesis algorithms [74], scalable
logic synthesis tools have been able to optimize increasingly large designs since the early
1990s. During logic optimization, different multi-level optimization strategies are inter-
leaved and executed several times, including fast extraction (finding different decompo-
sitions of a node based on algebraic transformations) and node simplification (exploiting
circuit don’t-cares). These techniques are correct by construction, so that the resulting
netlist is functionally equivalent to the original assuming o implementation errors are
present in the synthesis tool. We outline several key aspectof improving the quality and
scalability of synthesis below.
2.2.1 Don’t Care Analysis
To enhance synthesis, circuit flexibility in terms ofdon’t-carescan be exploited. Figure
2.3 provides examples of satisfiability don’t-cares (SDCs)and observability don’t-cares
(ODCs). An SDC occurs when certain input combinations do notarise due to limited
controllability. For example, the combination ofx = 1 andy = 0 cannot occur for the
circuit shown in Figure 2.3a. SDCs are implicitly handled when using SAT in validating
the netlist because SDC input combinations cannot occur forany satisfying assignment.
ODCs occur when the value of an internal node does not affect th outputs of the circuit
because of its limited observability [26]. In Figure 2.3b, whena = 0 andb = 0, the output
value ofF is a don’t-care.
Figure 2.4 shows a strategy for identifying ODCs for a nodea. First, the circuitD is
copied, anda is inverted in the copyD∗. Then an XOR-basedmiter [13] is constructed
between the outputs of the two circuits. A miter is a single output function typically imple-
23
Figure 2.3: Satisfiability don’t-cares (SDCs) and observability don’t-cares (ODCs). a) An
example of an SDC. b) An example of an ODC.
Figure 2.4:ODCs are identified for an internal nodea in a netlist by creating a modified
copy of the netlist wherea is inverted and then constructing a miter for each
corresponding output. The set of inputs for which the miter evaluates to 1
corresponds to the care-set of that node.
mented with XOR gates that compares the outputs of two circuits; functional equivalence







whereXi is an input vector.
A SAT solver can deriveC by adding successive constraints calledblocking clausesto





that is, the difference between the set of all input vectors and the care set.
This approach can be computationally expensive and scales poorly, particularly when
the XORs are far froma. In [61], this pitfall is managed by examining only small windows
of logic surrounding each node being optimized. The don’t-cares extracted are used to
reduce the circuit’s literal counts. In [95], a very efficient methodology is developed to
merge nodes using local don’t-cares through simulation andSAT. The authors limit its
complexity by considering only a few levels of downstream logic for each node. However,
these techniques fail to efficiently discover don’t-cares rulting from logic beyond the
portion considered, a limitation that is directly addressed in this dissertation.Another
strategy to derive don’t-cares efficiently entails algorithms for computing compatibility
ODCs (CODCs) [71, 72]. However, CODCs are only a subset of ODCs, and fail to expose
certain types of don’t-cares; specifically, CODCs only enable optimizations of a node
which do not affect other node’s don’t-cares.
2.2.2 Logic Rewriting
Performing scalable logic optimization requires efficientnetlist manipulation, typi-
cally involving only a small set of gate primitives. Given a set of Boolean expressions
that describe a circuit, the goal of synthesis optimizationis to minimize the number of
literals in the expressions along with the number of logic leve s. Several drawbacks of
25
Figure 2.5: Two examples of AIG rewriting.In the first example, rewriting results in a
subgraph with less nodes than the original. Through structual hashing, exter-
nal nodes are reused to reduce the size of the subgraph, as shown in the second
example.
these techniques are discussed in [62], including limited scalability. To this end, an ef-
ficient synthesis strategy calledrewriting was introduced [62]. Logic rewriting is per-
formed over a netlist representation called an And-Inverter Graph (AIG) [47], where each
node represents an AND gate, while complemented (dotted) edges represent inverters. In
logic rewriting, the quality of different functionally-equivalent implementations for a small
logic block in a circuit is assessed. For example, in Figure 2.5 the transformation on the
left leads to an area reduction. Moreover, by using a technique calledstructural hashing
[47], nodes in other parts of the circuit can be reused. For instance, in the example on the
right, there is a global reduction in area by reusing gate outputs already available in other
parts of the circuit. In [63], logic rewriting resulted in cir uit simplification and was used
to improve the efficiency of combinational equivalence checking.
2.2.3 Physically-aware Synthesis
Logic synthesis can be guided by metrics other than literal-count reduction. Although
detailed wire information is unavailable during logic synthesis, rough delay estimates can
be made by placing gates before synthesis optimization. Instead of reducing literals, one
26
favors delay-improving transformations [21]. However, delay estimation is becoming in-
creasingly inaccurate before detailed placement and routing, as the actual interconnect
routes become more significant with every technology node. This continuing trend sug-
gests the need to integrate new synthesis algorithms after placement and routing, rather
than optimize beforehand with inaccurate estimates, whichcan have undesirable conse-
quences for other performance metrics.
2.3 Logic Simulation and Bit Signatures
Logic simulation involves evaluating the design on many different input vectors. For
each simulation vector, the circuit output response can be determined by a linear topological-
order traversal through the circuit. In our work, we exploita type of information known
as asignature[50], that can be associated with each node of the circuit andis computed
through simulation.
A given nodeF in a Boolean network can be characterized by its signatureSF for
K-input vectorsX1 · · ·XK.
Definition 2.3.1 SF = {F(X1), . . . ,F(XK)} where F(Xi)∈ {0,1} indicates the output of F
for input vector Xi.
VectorsXi can be generated at random and used inbit-parallel simulation [1] to com-
pute a signature for each node in the circuit. For a network with N nodes, the time com-
plexity of generating signatures forK input vectors for the whole network isO(NK).
Nodes can be distinguished by the following implication:SF 6= SG ⇒ F 6= G. Therefore,
equivalent signatures can be used to efficiently identify potential node equivalences in a
circuit by deriving a hash index for each signature [50]. Since SF = SG does not imply
27
thatF = G, this potential equivalence must be verified,.g., using SAT. In [59], simula-
tion was used to merge circuit nodes while incrementally building a mitered circuit.The
resulting mitered circuit is much smaller and is typically easier to formally verify since the
corresponding SAT problem has fewer clauses, and it is thus often easier to solve.
2.4 Summary
Improving algorithms for logic synthesis and verification is an important, difficult and
multi-faceted challenge. Effective and scalable solutions may significantly impact elec-
tronic design automation and consumer electronics industries. In developing such solu-
tions, we leverage logic simulation used in conjunction with synthesis and verification to
improve scalability. While some research has shown the benefits of using simulation to
boost the performance of combinational equivalence checking, using signatures to guide
synthesis optimizations has only been considered in a few, limited forms. In our work, we




Challenges to Achieving Design Closure
Achieving timing closure is becoming more difficult becauseof the increasing signif-
icance of interconnect delay. When gate delay was the primary component of chip delay,
logic synthesis tools could accurately estimate and improve delay by reducing the maxi-
mum number of logic levels in a circuit. However, the resistance and capacitance of wires
have increased, preventing interconnect delay from scaling s well as gate delay. Figure
3.1 shows a plot from the 2005 report of the International Technology Roadmap for Semi-
conductors (ITRS) indicating that gate delay is foreseen todecrease for future technology
nodes faster than local interconnect and buffered global interconnect delay.
Because of the increasing significance of interconnect delay, designs that are optimized
by traditional logic synthesis strategies often contain delay violations that are discovered
only after accurate interconnect information is availabletoward the end of the design flow.
Therefore, optimization is typically confined to physical transformations, as opposed to
logic or high-level design optimizations. Performing higher-level optimizations is often
problematic at the late design stages because they could perturb cell placement and cause
modifications that would violate other performance metricsand constraints. Furthermore,
there are less opportunities to perform logic optimizations with new timing information
late in the design flow, because the design has already been optimized thoroughly based
29
on earlier incomplete and inaccurate information. Therefore, more restrictive physical
synthesis techniques are used after placement, including interconnect buffering [56], gate
sizing [45], and cell relocation [3], which can all improve circuit delay without signif-
icantly affecting other constraints. When timing violations cannot be fixed with these
localized techniques, a new design iteration is required, an the work of previous opti-
mization stages must be redone. In many cases, several time consuming iterations are
required, and even then, the finished product may not achieveall the desired objectives.
Figure 3.1:Delay trends from ITRS 2005. As we approach the 32nm technology n de,
global and local interconnect delay become more significantompared to gate
delay.
Achieving timing closure without undergoing many design iterations is a pervasive
problem studied in electronic design automation, and the leading cause of growing times to
market in the semiconductor industry. For this reason, we devote much effort to this issue
in the dissertation. In this chapter, we highlight previousefforts dedicated to improving
timing closure, as well as their shortcomings. We first describe ideas to improve the quality
30
of physical synthesis; then we discuss how the design flow hasbeen transforming over
time from several discrete steps into a more integrated strategy that can better address
interconnect scaling challenges. Finally, we conclude by outlining our strategy to tackle
the timing closure challenge while overcoming the limitations of previous work.
3.1 Physical Synthesis
Physical synthesis is the optimization of physical characteristics, such as delay, using
logic transformations, buffer insertion, wire tapering, gate sizing, and logic replication.
Static timing analysis (STA) is used to estimate the delay guiding the physical synthesis
optimizations. The accuracy of timing analysis is dependent on the delay model considered
and the wire information available. For instance, the Elmore delay model [30] is a well-
established approximation for delay based on cell locations. The following equation gives






Notice that delay increases quadratically as a function of wire length.The Elmore delay
model is commonly used, but tends to overestimate delay for long nets. Other delay mod-
els, such as D2M [6], are more accurate because they use multiple delay moments, but are
also more complex.
Physical optimizations can produce a placement where cell locations overlap. Overlap-
ping cells can be eliminated through placementl galization. Effective legalization strate-
gies must produce a placement with no cell overlaps, while not significantly disrupting the
31
bulk of the alreadylegalplacement. Even small changes in the placement can substantially
alter circuit timing. Therefore, after legalization, incremental STA is performed to assess
the quality of the optimization. Evidently, an optimization that produces large changes in
the netlist is undesirable because the legalization procedure could be more disruptive to
the pre-existing placement than the benefit brought by the optimization.
Figure 3.2:Topology construction and buffer assignment [43]. Part a) shows the initial
topology and part b) shows an embedding and buffer assignment for that topol-
ogy that accounts for the time criticality ofb. In part c), a better topology is
considered whose embedding and buffer assignment improvesthe delay forb.
The primary optimization strategies used in physical synthesis include reducing capac-
itive load to improve the delay of long wires. Unbuffered long wires experience quadratic
increase in delay as a function of wirelength, as shown in thedelay model of Equation
3.1. However, by optimally placing buffers along a wire, thedelay can be made linear
[67]. Moreover, buffers have been successfully deployed toimprove delay for cells that
have large fanout.For example, given a source signal that fans out to several nodes, op-
timal buffering involves both 1) constructing an interconnect topology that maintains the
connections between the source and the fanout and 2) assigning buffers to the topology so
that timing is optimized.
32
In Figure 3.2, we show an example of a netlist portion where a source signals fans
out to signalsa, b, andc. In part a), we show the original topology connecting the source
signal with its fanout. The topology indicates how fanout nodes are partitioned in the
interconnect. For instancea and b are grouped together in part a). Finding afanout
embeddingfor this topology corresponds to determining how the interconnect is routed
given the topology. One of the problems related to bufferingis that of determining the
actual buffer locations once the topology is fixed (buffer assignment). In [82], the authors
develop an efficient algorithm for placing buffers into a given interconnect topology to
optimize delay. Their approach would take into account the required arrival time by each
of the fanout nodes in placing the buffers. If fanoutb is timing critical, their solution
applied to the topology of Figure 3.2a would produce the buffer assignment shown in
Figure 3.2b where there is a short unbuffered path to fanoutb. More recently, the authors of
[43] considered approaches for finding an optimal topology in addition to optimal fanout
embedding and buffer assignment. For instance, if fanoutb is timing critical, a better
topology can be constructed as shown in Figure 3.2c, where thcapacitive load of signals
is reduced and the arrival time at fanoutb is still earlier thana andc.
Fanin embedding, studied in [42], is the process of finding optimal cell locations for a
one output subcircuit also for the purpose of improving delay. This is conceptually simi-
lar to the fanout embedding described above, where a routed in rconnect tree is derived
from a given topology. In fanin embedding, a fanin tree consists of a single output signal,
input signals, and internal nodes representing the subcircuit’s logic cells. The topology of
the subcircuit is determined by the logic implementation ofthe output. The goal of fanin
embedding is to find a placement that optimizes delay, while ensuring that no cell overlaps
33
occur. However, because of logic reconvergence, most subcirc its are not trees in practice.
To address this issue, the authors of [42] explore a procedure that uses replication to con-
vert any subcircuit into a tree. Unlike the work in [43] that considers fanout embedding on
different topologies, fanin embedding has limited range ofsolutions because the topology
is fixed by the logic implementation.
Figure 3.3:Logic restructuring. The routing of signalwith late arrival time shown in
part a) can be optimized by connectinga to a substitute signal with earlier
arrival time as in part b). In this example, the output of the gateAND(b,c) is a
resynthesis ofa.
So far, in this section we have analyzed physical synthesis strategies that involve
buffering, finding optimal cell locations, and cell replication. We now discuss logic re-
structuring techniques. Logic restructuring are netlist transformations that preserve the
functionality of the circuit. For example, in [18, 53, 78] logic restructuring is performed
by using only simple signal substitutions, thus missing many other restructuring oppor-
tunities. The authors in [84] develop an approach to identify restructuring opportunities
involving more types of logic transformations, but global don’t-cares are not exploited.
For example, consider Figure 3.3, where signala can be replaced by using the output of
the AND(b,c) gate, producing a small change to the netlist that hopefullyhas minimal
impact on the existing placement. The work in [18] uses simulation to quickly identify
potential substitutions, but does not explore the full range of potential transformations; for
34
instance observability don’t-cares are not taken into consideration. In [53], redundancy
addition and removal (RAR) techniques are proposed to improve circuit timing. Our em-
pirical analysis of RAR in Chapter IX shows that these techniques leave significant room
for improvement because the transformations they considerare somewhat limited.
Examining a broader set of transformations is challenging for current synthesis strate-
gies. In a post-placement environment, a netlist has already been mapped to a standard-cell
library and restructuring must be confined to a small sectionof logic, so that placement
is not greatly disrupted. If this small section were restructured using rewriting [62] (as
explained in Section 2.2.2), logic sharing would be difficult with AIG-based structural
hashing [47] since the netlist may be mapped to something considerably different from
an AIG. The solution we propose in Chapter IX overcomes this limitation by integrating
logic sharing on an arbitrarily mapped netlist with aggressive restructuring.
3.2 Advances in Integrated Circuit Design
Figure 3.4a illustrates the traditional division of designstages outlined in Chapter I. In
this flow each stage is invoked only once, and design closure iexpected to be achieved at
the end of the flow. In practice, extensive physical verification is required to assess whether
design goals are achieved and whether all constraints are satisfied. Often, design closure
fails in several ways. For example, logic synthesis fails ifthe total area of the synthesized
netlist is greater than specified. The result is a re-iteration of synthesis, perhaps with a
modified objective function. In a more dramatic scenario, the design might need to be
re-optimized at a higher level of abstraction. As another example, if routing resources are
insufficient for a wire, an iteration of the placement phase could reduce congestion.
To avoid costly iterations of several of the design stages, thi traditional design flow
35
Figure 3.4:Evolution of the digital design flow to address design closure challenges due
to the increasing dominance of wire delay. a) Design flow withseveral discrete
steps. b) Improved design flow using physical synthesis and refined timing es-
timates to achieve timing closure more reliably. c) Modern design flow where
logic and physical optimization stages are integrated to leverage better timing
estimates earlier in the flow.
comprising several discrete optimization stages has evolvd into a more integrated strategy
where the boundary between synthesis and place-and-route are blurred. The goal of design
integration is to perform optimizations that are simultaneously aware of many objectives
and constraints, such as performing logic optimizations during synthesis to lower wire
congestion in the placed design. The evolution of the designflow is described in the
following paragraphs and is shown in Figure 3.4.
Figure 3.4b illustrates a design flow incorporating physical synthesis and iterative feed-
36
back to achieve timing closure. Initially, post-placementop imizations (physical synthe-
sis) were sufficient to manage increasing wire delay by improving delay of global in-
terconnects through buffer insertion.However, as wire delay has become more and more
prominent,wire-loadmodels have been incorporated in logic synthesis to target gat s with
large capacitive loads. Even though these models only consider a gate’s input capacitance,
inaccurate estimates and optimizations could be easily corre ted at later stages through
gate sizing [45]. Eventually these approximate models becam inadequate too, as wire
capacitance and resistance became even more significant, resulting in additional design
iterations and requiring better delay models.
To address the challenges posed by increased wire delay, an even more integrated
design flow is currently in use (shown in Figure 3.4c), where more physical information is
used at earlier stages of the design flow. For example, to improve wire delay, the authors
in [40] incorporate wirelength estimation in logic synthesis, so that the netlist generated
would likely have fewer wire detours. In [21, 41, 68], incremntal placement is coupled
with logic synthesis to assess the quality of each synthesistran formation in terms of its
impact on placement (this approach is known ascompanion placement). The companion
placement can even be generated by attempting to place a netlist mapped to a simple gate
library before synthesis optimization. Another strategy integrates physical information
in early design stages to control the dominance of global interconnect (as forecasted in
Figure 3.1). This solution generates an RTL floorplan beforel gic optimization, so that
it can estimate timing for about the 10% of wires mostly constituting global interconnect
[104].
37
3.3 Limitations of Current Industry Solutions
Even with better integration in the design flow, physical synthesis remains essential to
avoid delay violations and only grows in importance with every t chnology node. Physical
optimizations are limited because there is less opportunity for logic transformations at
late design stages, reducing opportunities for improvement. Furthermore, post-placement
optimizations must minimize the region affecting placement to contain hard-to-predict
impact to delay due to legalization.
There are several additional limitations to the current methodology [103, 104] that are
expected to exacerbate in future technology nodes:
1. Timing-driven transformations may fail to improve delay, but still negatively impact
area. This may occur due to inaccurate timing estimates. In addition, incorporating
timing models in traditional synthesis and technology-mapping increases computa-
tional effort.
2. Maintaining a companion placement during logic synthesis incurs computational
overhead and may be inadequate for future technologies where ev n more accurate
delay estimates are needed. First, generating placement for an un-optimized netlist
with more logic cells than those in the final layout unduly stre ses placement al-
gorithms. Second, the accuracy is still limited because thecompanion placement
estimates cell locations approximately and cannot determine actual wire routes and
parasitic effects that can significantly affect delay.
3. The impact of poor optimization on shorter interconnect is becoming more pro-
found, and using common physical synthesis strategies, such as buffering, may be
38
insufficient. In [73], it was observed that future technologies will require buffers at
much smaller wire lengths. It was estimated that at the 45-nmnode, 35% of cells
in a large synthesizable block would be buffers. This fraction is projected to in-
crease to 70% at the 32-nm node. Even if timing is maintained using buffering, the
consequences for area utilization and power consumption will be severe. Current
methodologies using higher-level estimates before physical ynthesis will be unable
to account for the increasing relevance of shorter interconnects.
To achieve better design flows, intuition suggests to incorporate synthesis optimization
after placement because at that point the synthesis processcan utilize more accurate timing
information. This allows for more powerful timing optimizations since more detailed
estimates are available, while minimizing negative impactto other performance metrics.
However, traditional post-placement synthesis optimization is inadequate because it only
considers a small subset of possible transformations, and fails to fully exploit the full range
of possibilities (for instance, due to don’t-cares).
3.4 Contributions of the Dissertation
In this chapter, we have described the evolution of the design flow in the last few
decades to address the increasing dominance of interconnect. It is becoming more difficult
to provide accurate timing information to logic synthesis,and current physical synthesis
strategies are becoming inadequate at generating the delayimprovement necessary to re-
duce costly design iterations. Our work overcomes the restrictiveness of current physical
synthesis methodologies in improving interconnect delay.In summary, the major contri-
butions of this dissertation to achieve this goal are:
39
• A post-placement resynthesis strategy that tightly integrates accurate physical con-
straints to improve critical path delay by constructing subcircuit topologies using
static timing analysis. Our strategy greatly exceeds the optimization capabilities of
traditional logic synthesis techniques, and, at the same ti, minimizes perturba-
tions to the placement.
• A novel metric for identifying sections of critical path in anetlist that are most
amenable to logic resynthesis.
• A comprehensive simulation-based framework that uses signatures to identify post-
placement optimizations in complex designs. The components of our framework
include:
– A solution that identifiesautomaticallyareas of a circuit inadequately sensi-
tized under random simulation and relies on a SAT-based technique to gener-
ate new simulation vectors to target these areas. This improves the quality of
signatures, which enables more efficient identification of resynthesis opportu-
nities.
– A novel parallel SAT solver infrastructure that produces better utilization of
multi-core systems and therefore faster verification of synthesis optimizations
identified with signatures.
– A powerful synthesis approach that uses signature-based abtractions to quickly
identify functionally equivalent logic structures up to global circuit don’t-cares
increasing the number of post-placement resynthesis opportunities.
40
Part II
Improving the Quality of Functional
Simulation
The effectiveness of bit signatures in enabling powerful physical optimizations is the cen-
terpiece of this dissertation. This effectiveness dependso the ability of signatures to 1)
distinguish circuit nodes that are functionally differenta d 2) identify potential logic trans-
formations that elude traditional logic synthesis techniques. In Chapter IV, we introduce
a strategy that improves the quality of the input stimuli forsimulation so that it produces
high-quality signatures,i.e., signatures that better distinguish functionally different nodes
with just a few simulation cycles. Then in Chapter V, we propose an algorithm to incor-





As mentioned in Section 1.4, the advantage of using a bit signature to identify potential
logic optimizations lies in its ability to characterize a circuit node’s functionality with
only a partial truth table. To generate a signature, input vec ors are applied to the inputs
of the circuit. The input vectors chosen determine thequality of the signature. High-
quality signatures are defined as those that require few input vectors to generate, where
the signatures’ values distinguish functionally different nodes. However, the usefulness
of signatures in guiding powerful design optimizations also depends on the efficiency of
generating these high-quality signatures. If prohibitivecomputation efforts are required,
the benefits of using signatures are negated.
Quickly generating high-quality input vectors is the challenge that this chapter ad-
dresses. Traditionally, input vectors for verification arecr ated by performing a mixture of
random simulation andguidedsimulation strategies. Guided simulation involves choos-
ing input vectors either manually or through an automated mechanism to achieve some
simulation coverage goal. The advantages of random simulation re 1) the speed at which
new input vectors can be generated and 2) the ability of theseinput vectors to often expose
scenarios that guided simulation fails to capture. However, random simulation does not
always produce high-quality signatures. In fact, two nodesmay have similar functionality
42
(and truth tables), which often requires a large number of random input vectors to dis-
tinguish them. To reduce the number of input vectors required to generate high-quality
signatures requires either manually deriving input vectors using the expertise of the de-
signer or automatically deriving input vectors using a proof engine like SAT. For instance,
a SAT solver can be used to derive an input vector that distinguishes two circuit nodes [59].
However, using either manual or automatically-guided simulation to distinguish nodes is
often time consuming compared to random simulation.
In complex designs, random simulation struggles to expose interesting behavior, as
many components of a circuit are deep in the circuit’s structure and thus difficult tocon-
trol from the primary inputs. In other words, applying differenti put vectors often does
not correspond to differences in the internal nodes of the component under analysis re-
sulting in signatures that do not distinguish functionallydifferent nodes. The use of for-
mal methods can improve the signature quality of a single node; however, many of the
logic transformations described in later chapters requiregood signatures for many internal
nodes, and making numerous invocations of formal engines undermines the computational
efficiency of using signatures as an abstraction.
In this chapter, we introduce an approach that produces high-quality input vectors by
leveraging the benefits of both random and guided simulation. This approach, calledTog-
gle, involves 1) identifying components in a design that are notcontrolled adequately by
random input vectors and 2) targeting each component with guided simulation leveraging
a SAT solver. Our guided simulation overcomes the limitations f previous methodology
by generating input vectors that target several circuit nodes simultaneously, rather than one
at a time. Furthermore, our approach is applicable to verifying the functional correctness
43
of a design, as we show a correlation between improving the quality of signatures and
improving the verification coverage in a design.
4.1 Improving Verification Coverage through Automated Constrained-
Random Simulation
Constrained random simulationcan be used to expose interesting behavior in a de-
sign [96]. Constrained random simulation involves the addition of constraints that limits
and controls the input combinations that are sent to the design. However, there are ma-
jor challenges involved in constrained-random simulationhat we address in this chap-
ter. First, generating constraints suggests that the design team has a thorough knowledge
of internal aspects of the design.Second, generating specific input vectors that satisfy
given constraints requires the use of formal methods. This second challenge is partially
addressed by [86], where constraints are modeled as BDDs andimulation vectors are ob-
tained through a random walk of the BDD. However, the approach in [86] still requires
complex constraint specifications, and it is limited in the constraint complexity that it can
handle by its dependency on BDD size. Toggle overcomes thesechallenges by automati-
cally generating constraints that guide simulation without requiring detailed knowledge of
the design. Furthermore, these constraints are small in general and input vectors can be
efficiently derived using a novel SAT-based approach.
The high-level flow of Toggle is illustrated in Figure 4.1. First, we apply low-effort
synthesis to the behavioral specification of a design, so as to leverage gate-level tools.
Then, we apply random simulation vectors to the netlist. To analyze the toggle activity
of each signal in the netlist, we have introduced a novel entropy-based coverage metric.
Using an entropy calculation, we search for internal signals that suffer from low switching
44
Figure 4.1:Our Toggle framework automatically identifies the components of a circuit
that are poorly stimulated by random simulation and generates input vectors
targeting them.
activity, because they produce signatures that are less-likely to be capable of distinguishing
functionally different nodes. These signals are used to guide a partitioning of the design,
so that signals experiencing low activity are grouped together. Low-activity partitions
are then targeted by our SAT-based guided simulation, as shown at the end of the flow.
By targeting several partitions with guided simulation, weseek to evenly sensitize the
different components of a design.
The key theoretical result that we leverage to increase the switching activity of a parti-
tion is from [81]. It enables us to derive an even distribution of input vectors stimulating a
partition by automatically generating small random constrain s that involve XORing sets
of inadequately stimulated signals.We then derive targeted stimuli by invoking a SAT
solver. Our technique is flexible in that it can evenly sensitize parts of the design while
incorporating other designer-specified constraints. We apply our analysis to commonly-
used benchmark designs and demonstrate that many of them experience very low toggle
coverage under random simulation. In contrast, our technique achieves higher simulation
coverage than random simulation and is orders of magnitude fast r, when compared to a
45
guided simulation approach.
The chapter is organized as follows. In Section 4.2, we propose a solution for mon-
itoring activity in a design based on entropy. In Section 4.3, we introduce a strategy to
re-simulate areas of a circuit to increase its toggling activity. Finally, experimental results
comparing Toggle to constrained-random simulation are shown in Section 4.4.
4.2 Finding Inactive Parts of a Circuit
In this section, we adapt the notion of Shannon’s entropy [75] to estimate simulation
coverage within a gate-level circuit and propose its use to find inadequately-stimulated
regions. We then show that obtaining high entropy corresponds to evenly sensitizing a
design and thus minimizes unintended simulation bias, which can help in exposing corner-
case behavior.
4.2.1 Toggle Activity of a Signal
The toggle coverage for a single signals in a circuitC can be determined by the dis-
tribution of 0s and 1s seen under input stimuli. Each 0 corresponds to amaxtermand
each 1 to amintermof the function implementings. We capture this distribution with two















wherenOnes(nZeroes)is the number of simulation cycles for whichs= 1 (s= 0) andK
is the total number of simulation cycles examined.Es assumes values ranging from 0 to
1, where higher entropy indicates a more even distribution of ones and zeroes. Ifs is the
46
output of a function depending on the set of Boolean variables X, we can relateEs to the






Based on Equation 4.2, if the input vectors applied toX are uniformly distributed (and
thusEXi = 1 for every input), the maximum entropy forEs is 1. For instance, an even
distribution of input vectors applied to an XOR function result in highEs; in contrast, for
an AND function,Es is low because it has 1 minterm and 2|X|−1 maxterms. Ifs fans out
to other parts of the circuit, the signal’s low entropy can bea limiting factor in achieving
high switching activity in downstream logic, as indicated by Equation 4.2.We observe
that the signal entropy can be increased by setting the signal to either 0 or 1 (depending
on which value occurs less frequently) and then deriving an input vector that satisfies this
condition with a SAT solver.
As a practical example of guiding simulation based on signale tropy, consider the
impact of random simulation on an 8-bit bidirectional counter, as shown in Figure 4.2a.
The results indicate that after many simulation vectors, random stimuli do not adequately
toggle the most significant bit of the counter. We toggle the output bit of the counter
with the smallest entropy by deriving a sequence of counter op ations that flip this value
using Toggle. Figure 4.2b shows that the techniques described in this work achieve an
even distribution of entropy across each counter bit after only 300 additional simulation
vectors, while the same result requires over 10000 vectors in a pure random simulation
environment.
47
Figure 4.2:The entropy of each bit for an 8-bit bidirectional counter after 100, 1000, and
10000 random simulation vectors are applied is shown in parta). Part b) shows
the entropy achieved after 100, 200, and 300 guided vectors are applied after
initially applying 100 random vectors.
4.2.2 Toggle Activity of Multiple Bits
We extend the notion of signal entropy for a single signal to aset of signals that ex-
perience low activity when correlated to each other. We firstidentify these sets of signals
as small cuts in the circuit determined by automatic netlistpar itioning. We then define
a coverage metric to assess the activity along the partitioninputs that accounts for signal
correlation.
Automatic circuit partitioning. Circuit partitioning has been explored in physical place-
ment applications where net-cut minimization strategies gnerally lead to smaller wire-
48
length. The Fiduccia-Mattheyses (FM) min-cut partitioning algorithm [31] is commonly
used for circuit partitioning and runs in linear time on the netlist size per optimization
pass. Typically, only a few passes are required to achieve a good partition.Furthermore,
multi-level extensions of this algorithm scale near-linearly to very large designs.
In our Toggle flow, we only need to generate a partition of the circuit once, hence
its runtime is amortized by the subsequent input generation. T partition the circuit, we
perform recursive bisection,i.e., apply multiple cuts in the circuit until it is partitioned to
a desired granularity (specified by a designer). The goal of this procedure is to minimize
the total number of nets connecting the different partitions while ensuring that partitions
have similar sizes. This leads to the generation of large partitions with few input signals,
so that the activity for a large section of logic can be determined by only a small set of
controlling signals.
To identify input cuts that experience low activity, we use th signal entropy defined
in Equation 4.1 to guide the partitioning objective function. We note that the maximum
entropyEF of a set of signalsF is bound by∑s∈F Es. After assigningEs as a weight to net
s, we can employ netlist partitioning to find cuts with small entropy. This creates partitions
with inputs whose total entropy is small, which, in turn, exposes parts of the circuit that
are inadequately sensitized.
Estimating cut activity and biasing through entropy. The activity along a derived cut
can be analyzed to assess the amount of simulation coverage for each partition. Consider
the following metric for cut activity on partitionF:
AcF = numdiff vecs(< F1, · · · ,Fm >)(4.3)
49
wherenumdiff vecsis the number of different value combinations that occur on partition
F ’s m-input cut. We observe that this formula does not consider thfrequency of certain
combinations — it only provides the number of different combinations that is simulated.
Consider the following Boolean functionF(gn(X),hm(Y)) whereX∩Y = /0 andgn and
hm aren andm output functions respectively. We analyze the activity along the outputs of
g andh respectively using Equation 4.3. According to Equation 4.3, high activity would
occur if Acg = 2
n (if all output combinations are possible), and likewise ifAch = 2
m. Func-
tion F hasn+ m inputs. Assuming maximum activity along the outputs ofg andh, the
maximum activity alongF ’s input cut would be 2n2m. Because the activity metric does
not account for the frequency of combinations, there is no insight on whether repetitive
value combinations occur frequently forF(gn,hm). Avoiding repetitive combinations is
desirable so that the broadest span of behavior in the circuit s explored. However, we do
know that there are at leastmin(2n,2m) different combinations.
We improve this coverage metric to account for repetitive value combinations, so that
we can better guide our re-simulation efforts. To account for his repetition, which we call
simulation bias, we develop a measure based on the amount of information (entropy) as-
sociated with the signals along a partition inputs underK simulation vectors. We compute
the entropy ofF as:








whereocc is the number of occurrences of a particular vectorvec representing a value
combination along the input cut, which is represented by an integer value. Under this
formulation, the entropy is high when there are several different value combinations.
50
Using the entropy metric we can defineeven sensitizationformally under entropy as:
Definition 4.2.1 A set of inputs X to function F is evenly sensitized iflimK→+∞ EKF = |X|
where2|X| is the number of possible input combinations along X.
When the number of possible input combinations is less than 2|X|, due to limited control-
lability, the entropy corresponding to even sensitizationis log2(the number of maximum
different value combinations). Because the number of inputvec orsK applied during sim-
ulation is typically much smaller than the number of possible nput combinations, a set of
inputs is evenly sensitized underK input vectors whenEKF ≈ K.
Considering the Boolean functionF(gn(X),hm(Y)), we can determine the maximum
entropy along the outputs ofgn asEKg = |X|, where|X| is the number of inputs tog. In
other words, we see that the outputs of a function can be sensitized at best as evenly as its
inputs. For areversiblecircuit, there is a one-to-one mapping between input and output
combinations, so that the entropy over the inputs is equal toen ropy of the outputs. The
entropy for then+m inputs ofF, EKF , can now be bound as follows:
min(EKg ,E
K







Unlike the metric in Equation 4.3, by using entropy we can provide a bound to measure
how even is the sensitization ofF . In other words, we encapsulate more information about
the behavior of circuit by using entropy rather than simply counting the number of input
vectors (we later explain how to estimate the number of possible input vectors so that the
maximum possible entropy is known). There is an additional benefit to using the entropy
metric. By stimulating the partition with the smaller entropy, eithergn or hm, we increase
the lower-bound in Equation 4.5 for downstream logic.
51
4.3 Targeted Re-simulation
Toggle uses the entropy measure previously described to findparts of the design with
low activity. We now introduce a SAT-based strategy that uses random XOR constraints to
produce an even distribution of simulation vectors along a partition cut with low activity.
The motivation for producing an even distribution is to find corner-case behavior, that
could not be be exposed previously without detailed knowledge of the design and the
generation of complex constraints.
Deriving a distribution of input vectors that evenly sensitize certain signals using a SAT
solver is challenging because state-of-the-art SAT solvers do not provide any guarantees
on the quality of the distribution. On the other hand, traversing a BDD to derive an even
distribution of input vectors as in [86] may require prohibitive amounts of memory to
represent the circuit. These challenges can be partially addressed by using techniques
developed in the AI community, where a SAT solver is modified to evenly sample the
solution space [25, 44]. However, these approaches are incompatible with DPLL-based
SAT solvers, which are often more effective in solving EDA instances. This limitation is
partially addressed in [36], which uses randomly generatedXOR constraints to modify the
SAT instance so any SAT solver can sample its solution space mor evenly. At first sight,
these techniques are not directly applicable to IC verificaton since we desire to derive
input vectors that expose corner-case behavior in a circuit, b t our work provides several
missing links to make this connection.
4.3.1 Random Simulation with SAT
In this section, we first discuss the theoretical underpinnings that are used in our strat-
egy to evenly sample the SAT solution space. We then propose astrategy to improve
52
the sensitization quality of a set of signals in a circuit, while satisfying the circuit’s input
constraints.
Theoretical background. Consider a SAT instance withN > 1 solutions. According
to [81], it can be transformed to an instance that admits onlye of thoseN solutions,
requiring only a randomized polynomial-time algorithm that adds a limited number of
XOR constraints. The algorithm succeeds in producing such an instance with probability
≥ 1/4. Below, we discuss an aspect of this result that is relevantto our work, that is,
adding a random XOR constraint reduces the solution space roughly by half with high
probability.
Assume that we are given a SAT instancef with variablesx1,x2, ...,xn, and with solu-
tionsv∈ {0,1}n. To reduce the solution space, we randomly choose an assignment of the
variablesw1 ∈ {0,1}n and add the following constraint tof : v•w1 = 0 in base-2 arithmetic
(where• is the dot product). This can be expressed as follows:
f ∧ (xi1 ⊕xi2 ⊕·· ·⊕xi j ⊕1)(4.6)
where i j represents the indices ofxi wherew1 is 1. This results in an XOR constraint
whereby an even polarity ofxi j determined byw1 needs to be assigned to 1. Alternatively,
a CNF representation can be given as:
f ∧ (y1 ⇔ xi1 ⊕xi2)∧ (y2 ⇔ y1⊕xi3)∧· · ·∧ (y j−1 ⇔ y j−2⊕xi j )∧ (y j−1⊕1)(4.7)
where they j variables are additional auxiliary variables required to expr ss the XOR con-
straint.
53
Example 1 Consider the CNF formula(a+ b+ c′)(b′ + d)(a′ + d′)(a+ c+ d), where
the solutions are :{abcd} : {0001,0101,0111,1000,1010}. The number of solutions
can be reduced by generating an XOR clause corresponding to the randomly generated
w1 : a = 0,b = 1,c = 1,d = 0. The resulting CNF would be(a+ b+ c′)(b+ d′)(a′ +
d)(a+c+d)(y⇔ b⊕c)(y⊕1) where only3 solutions{0001,0111,1000} remain.
If SF represents the set of all solutions ofF, then the addition ofk randomwk vectors,
or equivalently ofk random XOR constraints, reduces the size of the solution space to
∼ 2−k|SF | .
Random simulation with SAT. Through XOR-based reductions to aunique-SAT(U-SAT)
instance (an instance with one solution), any particular solution1 can be generated, which
is the basis for our approach for deriving an even distribution of simulation vectors. Based
on the results in [81], we can estimate that addingn XOR constraints for a CNF withn
variables produces a U-SAT instance. (Since this is an estimate, some instances may have
no solutionsi.e., are over-constrained, and some have multiple solutions. Instances with
no solutions are calledUNSAT.) Therefore, we can add multiple sets ofn different XOR
constraints to derive U-SAT instances where the unique solutions are evenly distributed
following from the randomness of the reduction. In a circuitapplication where we wish
to generate random input patterns, the XOR constraints needonly involve primary input
signals since the different ways to stimulate a circuit is completely determined by the as-
signments to the primary inputs of the circuit. Consequently, if an entire circuit is mapped
to a CNF, the XORs added will not involve internal signals andtherefore they will typi-
cally only increase the size of the original instance by a small amount. In principle, any
1The constraints derived from Equation 4.7 are satisfied whenan even number of variables in each con-
straint is assigned 1, which always permit the all-0s solutin.
54
SAT solver can be used to derive solutions for this modified SAT instance.
While our approach does not always produce instances with a unique solution, this
happens very frequently [81]. In our strategy, if an unsatisfi ble instance is produced, we
derive another one. If an instance has multiple solutions, the SAT solver selects one of the
remaining solutions. Using a SAT solver, we can derive an even distribution of simula-
tion vectors as we show empirically in Section 4.4. However,if one desires only a small
number of input vectors, a more efficient procedure can be used that requires the addition
of fewer constraints and minimizes the number of unsatisfiable instances produced. For
example, consider the case where only 64 evenly distributedinput vectors are desired for
circuit C with n primary inputs where 2n > 64. In this case, 6 XOR constraints can be
added to approximately reduce the solution space to164 f the original size. By adding
different random sets of 6 XOR constraints 64 times, we can still achieve an even dis-
tribution of solutions for the number of solution vectors desir d, with faster simulation
runtimes as shown in Section 4.4. In general, if we seekK simulation vectors, we solveK
SAT instances each with different sets ofl g2(K) XOR constraints.
The addition of designer-specified constraints for targeting design properties to the
original SAT instance does not affect the XOR formulation previously described. There-
fore, an even distribution of input vectors can be derived that satisfies these additional
constraints. Consider a circuitC with |SC| solutions and a circuit constrained with ad-
ditional designer-specified constraintsC∗ that has|SC∗| solutions. When|SC∗| << |SC|,
solutions that exist inSC, SCi , may rarely exist inSC∗ as illustrated in Figure 4.3a. By
adding log2(K) XOR constraints, we can deriveK vectorsSC∗i that are evenly distributed.
If numerous UNSAT instances occur, implying thatK > |SC∗|, then one can alternatively
55
exhaustively enumerate all the solutions.
Figure 4.3: a) XOR constraints are added to reduce the solution space of a SAT instance
C∗, which is sparser than the solution space ofC. b) ComponentA is targeted
for simulation, so that itsm inputs are evenly sensitized within circuitC.
4.3.2 Partition-Targeted Simulation
We now propose an approach to automatically stimulate internal partitions while sat-
isfying input constraints.
Stimulating a component within a design.Evenly stimulating an inadequately sensitized
component by choosing certain input vectors is not straightforward, because the relation-
ship between the distribution of stimuli on the primary inputs and on the inputs of the
component is often complex. In Figure 4.3b, we show a component A with m input signals
that that we desire to stimulate and that is deeply buried in the design hierarchy. We denote
the solution space ofA with respect to the input constraints asSAC, and denoteSA as the
solution space when not considering the input constraints.By applying random vectors
to them signals and checking whether the input constraints are satisfied, we can evenly
stimulateA. However, limited controllability could mean that the input constraints are
56
rarely satisfied leading to prohibitive runtimes. To this end, we propose a new SAT-based
methodology that expands upon our circuit simulation strategy in Section 4.3.1. For circuit
C and its subcircuitA, we observe the following relation between CNF formulae:
CNF(C) = CNF(C\A)∧CNF(A)(4.8)
Therefore a solution toC implies a solution toA. Since them signals uniquely determine
every legal input combination toA, we can reduce the solution space ofSA and subse-
quentlySAC by adding XOR constraints involving the variablesm:
CNF(C\A)∧CNF(A)∧ (mi1 ⊕mi2 ⊕·· ·⊕mi j ⊕1)(4.9)
This formulation reducesSA roughly in half, and since the input constraints are accounted
for by the constraintCNF(C\A), subsequently reducesSAC in half. Although manySCi
may map to oneSAi , the intention of this formulation is to generate input vectors that
evenly sensitize the component, not the entire circuit.
Algorithm. Functionpartition sim(), shown in Figure 4.4, generates an even dis-
tribution of simulation vectors by adding multiple random XOR constraints according to
Equation 4.9. The number of random XOR constraints added is determined by the number
of simulation vectors (num sims) desired. After constructing the CNF, designer-specified
constraints can be added (add additional constrs()). Then, we add different sets
of XOR constraints for each pass of the while loop by functionadd xor constrs().
When large XOR constraints are added, the increased cost of propagating implications on
a large set of clauses can be mitigated by (easily) adding specialized data structures and
decision procedures. However, our experiments indicate that small XOR constraints are
57
most common in our application and these usually do not slow dn the SAT solver ap-
preciably. Therefore, specialized solver extensions for XORs, as in [35], are unnecessary.
partition sim(Partitionpart, Circuit C, int num sims){




add xor constrs(num xor, part, CNF);







Figure 4.4:Partition simulation algorithm.
If the instance is satisfiable, we add the new simulation vector and decrementum sims.
If the solution space considered is small relatively to the number simulation vectors ap-
plied, the SAT solver frequently derives the same simulation vector again. This can be
eliminated by adding blocking clauses to the SAT instance. Also, numerous unsatisfiable
instances are produced when the number of desired simulation vectors is similar to the size
of the solution space. We can avoid these scenarios by estimating the size of the solution
space as described below.
Controllability estimation with XORs. To maximize the effectiveness of our SAT-based
simulation, we seek to target poorly-sensitized regions where the number of possible vec-
tors is also greater than the ones observed so far, so that ourSAT-based simulation has the
potential to generate many new vectors. Ensuring this requis estimating the number of
solutions for partitionA with respect to its input constraints|SAC|.
We now show how to estimate if|SAC| > (1+ ∆) ∗ numdiff vecsusing XOR con-
58
straints.2 In other words, we use XORs to find whether less than half of thepossible
vector combinations have already been observed along the partition’s inputs. To do this,
we use the result from [37] to estimate the number of SAT solutions with random XOR
constraints. For instance, if the addition ofx different XOR constraints does not produce
an UNSAT result, we can estimate that the solution space is ofsize≥ 2x. By examin-
ing multiple sets of different XOR constraints, we can obtain bounds with high accuracy,
as proved in [37]. Since we desire a lower-bound computationand need XOR constraints
that only involve the partition inputs, we can improve the efficiency of [37] for our specific
circuit application.
4.4 Empirical Validation
We show that adding XOR constraints can evenly stimulate a design and that Toggle
can improve activity for poorly stimulated partitions, while being considerably more effi-
cient than a guided random simulation approach. In our experimental evaluation, we use
MiniSAT [29] to derive simulation vectors and hMetis [46] toperform circuit partitioning.
We examine circuits from the IWLS 2005 suite [102] and consider only their combina-
tional portions.
Efficiently generating random stimuli with XOR constraints. In Table 4.1, we show
the results of performing our SAT-based simulation on the prima y inputs of circuitalu4 .
We report the entropy, the number of different simulation vectors (diff sim) generated,
and the runtime in seconds. For the results underSAT-based, we add 14 random XOR
constraints to generate U-SAT instances, until we derive#sim vectors (number of
simulation vectors). We compare this approach with random simulation and achieve com-
2we consider∆ = 1 in this work.
59
#sim rand SAT-based approx SAT-based
vectors diff entropy time(s) diff entropy time(s) #xor diff entropy time(s)
sim sim sim
64 64 1.00 <1 63 0.99 2 6 58 0.97 <1
128 128 1.00 <1 128 1.00 4 7 119 0.98 1
256 253 1.00 <1 256 1.00 6 8 240 0.98 1
512 499 0.99 <1 499 0.99 13 9 485 0.99 2
1024 991 0.99 <1 989 0.99 26 10 968 0.99 5
Table 4.1:Generating even stimuli through random XOR constraints forthe 14 inputs of
alu4. We normalize the entropy seen along the inputs by log2(#simvectors), so
that 1.0 is the highest entropy possible.
petitively high entropy. Since many of the reductions using14 XOR constraints produce
UNSAT instances, this formulation is computationally expensive. Therefore, we show,
underapprox SAT-based in Table 4.1, that by adding fewer XOR constraints, de-
termined by log2(#simvectors), we can significantly improve the runtime of the previous
SAT-based formulation with nominal degradation to the entropy. Although random simu-
lation is sufficient for this simple example, we now show thateven distributions of simu-
lation can be efficiently generated for internal signals while satisfying input constraints.
Identifying inactive parts of the circuit with Toggle. In Table 4.2, we show circuits that
are partitioned using the signal-entropy weighting objectiv using 1024 random simula-
tion vectors. After extensively experimenting with partitions of different sizes, we chose
partitions that are∼ 100 gates in size. Compared to partitions of larger or smaller size,
we observed empirically that this partition size most effectiv ly balances our desire for
examining the coverage of large parts of the circuit while mini zing the number of the
signals considered for entropy analysis and re-simulation. Our results are averaged over 5
independent runs.
We show the average and worst entropy, where 10.0 is the maximum entropy possible
60
circuit #gates average worst guide+32 rand+32
entropy entropy new comb %entr incr new comb %entr incr
spi 3010 9.5 6.4 +26.2 2.67 +1.6 0.12
systemcdes 3196 9.1 5.6 +15.0 0.48 +14.0 0.19
tv80 6847 8.9 1.6 +18.6 5.17 +0.8 -0.17
systemcaes 7453 9.7 5.2 +26.6 1.01 +12.4 0.16
ac97ctrl 10284 10.0 9.5 +24.8 0.43 +21.8 0.35
usb funct 11889 9.9 7.4 +26.4 1.10 +12.2 0.24
aescore 20277 7.5 4.1 +17.0 2.60 +4.6 0.14
wb conmax 28409 8.8 6.2 +25.2 2.12 +3.6 0.26
ethernet 37634 9.9 1.6 +26.2 2.12 +1.4 0.22
desperf 94002 9.1 5.0 +13.4 0.55 +5.0 0.17
Table 4.2:Entropy analysis on partitioned circuits, the number of newinput combinations
found and the percentage of entropy increase after adding 32guided input vec-
tors versus 32 random ones.
with 1024 random input vectors. The results indicate that, while the average entropy for
each circuit is close to 10. , there is usually at least one partition that is considerably
lower, as intv80. We can then perform simulation mainly over these few poorlycovered
partitions.
Improving activity with Toggle. In the next part of Table 4.2, we assess the improvement
of our SAT-based targeted re-simulation on a partition withlow entropy and a sufficiently
large solution space by deriving 32 additional simulation vectors. Our guided simulation
is compared to generating 32 more random vectors.In new comb, we report the number
of new combinations seen at the partition inputs averaged ovr 5 independent runs with
different random seeds, and in%entr incrwe report the percentage increase in entropy
for the partition. Our approach outperforms random simulation on almost everycircuit.
Random simulation performs poorly,e.g., ethernet andtv80, indicating strong bias
under random simulation.If no improvements to verification coverage for a partition are
possible with random simulation, the percentage increase in entropy hovers around 0.Our
61
circuit guided+32 part.random+32 entropy
time(s) time(s) time(s)
spi 1 210 <1
systemcdes 1 110 <1
tv80 1 time-out <1
systemcaes 1 110 1
ac97ctrl 1 2 <1
usb funct 1 18 <1
aescore 2 time-out 1
wb conmax 4 232 1
ethernet 10 107 2
desperf 20 23 2
Table 4.3:Comparing SAT-based re-simulation with random re-simulation over a partition
for generating 32 vectors. The time-out is 10000 seconds.
approach can still re-derive some previously seen vectors,but we minimize these occur-
rences by our estimation of the partition’s solution space siz , which prevents re-simulation
on partitions with limited controllability. Even forac97, which is evenly sensitized by
random simulation, we see some improvements because the worst-case entropy for the
partition targeted for re-simulation is not at the maximum value of 10.0.
Runtime efficiency of Toggle. In Table 4.3, we show that evenly simulating a partition
by randomly assigning values to its inputs and checking whether the primary input con-
straints are satisfied, is often much slower than using SAT-guided simulation. The results
indicate that the SAT-based simulation scales well for larger circuits, in part, because the
size of the XOR constraints required is typically small compared to the size of the circuit.
Also, our SAT-based simulation often achieves orders of magnitude runtime improvement
over random simulation, such aswb conmax andethernet. On the other hand, some
benchmarks time-out at 10, 00 seconds, such as fortv80 andaes core. These re-
sults indicate that the solution space of the partition stimulated is sparse with respect to
the input constraints. We expect our technique to perform even better when additional
62
designer-specified constraints are added, since this wouldfurther reduce the size of the
solution space. For completeness, the last column shows theruntime of the entropy calcu-
lation in Equation 4.4. Clearly, this calculation is fast and scales to large designs.
4.5 Concluding Remarks
Our framework shows that certain theoretical results, not used in verification and sim-
ulation previously, hold the potential to significantly improve simulation coverage. This
is done through careful feedback on coverage and biasing of input vectors to better stim-
ulate poorly-sensitized parts of the circuit. By improvingthe quality of the simulation,
we can expose interesting corner-case behavior in the circuit and encode it in signatures.
To achieve these goals, we have introduced 1) an entropy metric to characterize the veri-
fication coverage of internal signals and 2) a novel simulation framework that uses XOR
constraints to generate even distributions of stimuli while satisfying complex constraints.
Our coverage metric reveals circuit regions that are inadequately stimulated under random
simulation. We also show that adding only a few XOR constrains is often sufficient to
evenly sensitize a design. Finally, our results indicate that guided simulation can com-




Enhancing Simulation-based Abstractions with Don’t
Cares
In this chapter, we introduce a strategy to efficiently derivand encode don’t-care val-
ues in a bit signature using logic simulation. Using don’t-cares facilitates more powerful
synthesis transformations as shown in Chapter VIII.
Computing don’t-cares for a node is challenging because a node’s don’t-care set is
determined with respect to the primary inputs and outputs, which we will refer to asglobal
don’t-care analysis. Table 5.1 compares our analysis and its capabilities in computing
don’t-cares with previous work. One common theme among previous approaches is that
they typically do not consider the entire fanin and fanout cone f a node because of the
high cost of computation; we will refer to these solutions aslocal don’t-care analysis.
In [34, 59], a solution is proposed that can exploit satisfiability don’t-cares (SDCs), but
not observability don’t-cares (ODCs) by relying on a combination of SAT solving and
simulation in equivalence checking. In [61, 95], ODCs are considered, but only for a few
levels of logic in the node’s fanout cone.
In our approach, we can handle large circuits and derive all SDCs and ODCs with
respect to the input vectors used in producing logic signatures. To this end, we develop a
novel approximate simulator whose performance scales linearly with the size of the circuit.
64
Property Simulation-guided Window-based Local Our solution
SAT [34, 59] ODC+SDC [61] SAT-sweep [95]
Don’t-cares global SDCs local SDCs global SDCs global SDCs
computed local ODCs local ODCs global ODCs
Computational simulation + SAT primarily SAT simulation + SAT simulation + SAT
engines
Complexity SAT engine windowing levels of moving-dominator
limited by strategy downstream logic incremental SAT
(Chapter VI)
Primary verification synthesis verification verification; logic &
application physical synthesis
domain
Table 5.1:Comparisons between related techniques to expose circuit don’t-cares. Our so-
lution can efficiently derive both global SDCs and ODCs.
We evaluate the accuracy of our simulator both analyticallyand through empirical results.
5.1 Encoding Don’t Cares in Signatures
When using signatures, there is no need to identify SDCs explicitly because impossible
input combinations are not generated during logic simulation. However, some of the bits
in signatures do not affect the outputs of the circuit and therefore they represent ODCs. To
account for ODCs, we maintain anODC mask S∗f for node f in addition to its signature
Sf .
Definition 5.1.1 For input vector Xi, S∗f = {X1 6∈ ODC( f ), . . . XK 6∈ ODC( f )} denotes the
ODC mask for function f . ODC( f ) is a set of input vectors for which node f has an
observability don’t-care.
When an input vectorXi is in the setODC( f ), the corresponding bit position is denoted
by a 0.
65
Figure 5.1: Example of our ODC representation for a small circuit. For clarity, we only
show ODC information for nodec (not shown is the downstream logic deter-
mining those don’t-cares). For the other internal nodes, wereport only their
signatureS. When examining the first four simulation patterns, nodeb is a
candidate for merging with nodec up to ODCs. Further simulation indicates
that an ODC-enabled merger is not possible.
Figure 5.1 shows a circuit with signatures for each node and,in addition, a mask for
nodec. Each ODC for a node is marked by a 0 in the ODC mask. We express th logic
flexibility of a given node by maintaining anupper-bound signature Shi andlower-bound
signature Slo. Shif = Sf |¬(S
∗) f , where| represents bit-wise OR, andSlof = Sf &S
∗
f , where
& represents bit-wise AND.
Slof andS
hi
f of node f correspond to the range of Boolean functions[ f
lo, f hi] that can
implement f without modifying the circuit’s functionality because thelogical difference
between any pair of functions within[ f lo, f hi] is a subset of theODC( f ).
After simulation generates the signatures, potential optimizations can be identified. In
the example in Figure 5.1, after the first four simulation patterns, nodeb is identified as a
candidatefor implementingc, meaning that we have a potential node merger. However,
in this example, further simulation reveals that the candidate merger is not viable because
66
b andc are not compatible with respect to their last signature bit.In Chapter VIII, we
present a node-merging application that efficiently exploits this don’t-care encoding.
5.2 Global ODC Analysis
Below we describe a simulator with linear runtime complexity, that finds ODCs for
each node of a circuit. Generating ODC masksS∗f efficiently is integral to maintaining
the scalability of our signature-based framework. While each node’s signature can be
computed from its immediate fanin, computing each node’s ODC mask often requires
analyzing its entire fanout cone.
The maskS∗ can be computed for each node by using Equation 2.1,1 where theXi are
the random simulation vectors. This approach requires circuit simulation of eachXi for
each circuit’s node. ForK simulation vectors andn internal nodes, the time-complexity is
O(n2K).2 Although the simulation can be confined to just the fanout cone f the node, this
approach is computationally expensive.
5.2.1 Approximate ODC Simulator
To improve upon the baseline algorithm described above, we dev loped an approx-
imate ODC simulator whose complexity is onlyO(nK) (n is the number of nets in the
design andK is the number of simulation vectors). Our approach computesth ODCs
of one node at a time in a manner that reuses previous computation. An outline of the
algorithm for generating the masks in our approximate simulator is shown in Figure 5.2.
The functionset output S∗() initializes the masks of nodes directly connected to
the input of a latch or primary output to all 1s. The nodes are then ordered and traversed
1C(a) =
S
i:D(Xi) 6=D∗(Xi) Xi .





for eachnode∈ N {
node.S∗ = 0;
for eachoutput∈ node.fanout{
tempS∗ = get local ODC(node, output);





Figure 5.2:Efficiently generating ODC masks for each node.
in reverse topological order as generated byreverse levelize(). The immediate
fanout of eachnode is then examined. The functionget local ODC() performs ODC
analysis for every simulation vector fornode, as defined by Equation 2.2,3 except only
the subcircuit defined bynode andoutput is considered. This local ODC mask is
bitwise-ANDed withoutput’s S∗ and is subsequently ORed withnode’s S∗.
The algorithm requires only a traversal of all the nets givenby the twofor each
loops and theK simulation vectors considered for each net inget local ODC(), re-
sulting in theO(nK) complexity. This algorithm enables our global ODC simulator to be
more efficient than what can be achieved simply extending thelocal observability calcula-
tions in [95] to perform global ODC analysis.
We can apply our algorithm to the circuit in Figure 5.1 to compute the ODCs of
nodea from the ODC information shown for nodec. Because nodec has don’t-cares
for the second and third simulation bit, nodea also has don’t-cares for those bits. When





nodeb has a controlling value of 0.
5.2.2 False Positives and False Negatives
Since we do not consider logic interactions that occur because ofreconvergence, it is
possible for the algorithm in Figure 5.2 to incorrectly produce 0s (false positives) or 1s
(false negatives) in S∗. For the example shown in Figure 5.3, nodea misses a don’t-care
(false negative) in the third bit ofS∗a. Notice that nodeb andc do not have any ODCs and
no local ODCs exist betweena andb or a andc, resulting in no ODCs being detected by
the approximate simulator. However, the reconvergence of downstream logic makes the
third value of nodea a don’t-care. In a similar manner, false positives may occurdue to
the interaction of multiple signals with local ODCs at a reconvergent node.
False positives do not affect the correctness of signature-g ided transformations be-
cause each transformation is formally verified by equivalence checking. However, false
negatives limit the pool of potential optimizations available for resynthesis. We show em-
pirically in this chapter and in Chapter VIII that false negatives and positives occur rarely
and seldom affect the results produced.
Figure 5.3: Example of a false negative generated by our approximate ODC simulator due
to reconvergence.S∗ andSare shown for all internal nodes; onlyS is shown
for the primary inputs and outputs.
69
5.2.3 Analysis and Approximation of ODCs
We have observed that most ODCs require only a few levels of downstream logic to be
computed. Indeed, consider two nodes in a circuit,f andg, where f is in the fanin cone
of g. If X denotes the set of inputs tog, the number ofXi in the ODC set off whereg is
the output,ODCg( f ), is given by the following:
|ODCg( f )| = nZeroes
(
g( f = 0,X)⊕g( f = 1,X)
)
(5.1)
whereg is expressed as a composition ofX and f .
Assuming a uniform input distribution, the probability thaXi ∈ ODCg( f ) is equal to
|ODCg( f )|
2|X|
. In other words, this gives the probability that the output of f is a don’t-care for
input vectorXi. We offer a more insightful analysis by considering a subsetof Boolean
functions that have simple disjunctive decompositions [8], which are defined as:
Definition 5.2.1 Function g(X) (input set X) has a disjunctive decomposition if g can be
expressed as g∗(h(Y),Z) where X= Y∪Z and /0 = Y∩Z.
In practice, many functions in a circuit can be expressed as disjunctive decompositions
[11, 70].
We assume thatg has the disjunctive decomposition ofg∗( f (Y),Z)4 and note the fol-
lowing theorem based on Equation 5.1:







Proof. g∗( f (X),Z) can be expressed as aZ+1 input functiong∗(w,Z) wherew = f (Y).
The Z-input functionsg∗w andg
∗
w′ correspond tog
∗(w = 1,Z) andg∗(0,Z) respectively.









(independent ofw). SinceY is independent ofZ, there are 2|Y| different sets of theseZi
combinations where the value ofw does not affect the output ofg. 2
For the disjunctive decompositiong∗( f (Y),Z), the probabilityXi ∈ ODCg( f ) can be
expressed as:








Notice that this expression is independent of the function of f .
We can now develop a lower bound on the probability thatf has an ODC for a given
input vectorXi indicated by the following theorem:






w) gives the number of input vectors wherew is independent from
g (a don’t-care with respect tog). The difference in the number of minterms and maxterms
in g gives a lower bound to the number of input vectors wherew is independent ofg. 2
Note that the entropy ofg∗(w,Z) corresponds to a lower bound of the probability of
ODCs. Functions with low entropy,i.e. high information loss, have a high percentage of
input vectors with ODCs.
Example 1. Consider f as a primary input andg(X) as a|X|-input AND gate, which
has 1 minterm and 2|X|−1 maxterms.2
|X|−2
2|X|
, given by Theorem 2, is the lower bound of
PXi∈ODCg( f ). In this case, the lower bound is also the probability as given by Corollary 1.
For a two-input AND, the probability is12 and for five-input AND the probability is
15
16. If
g is implemented with a set of two-input AND gates wheref is at the first logic level, we
see that the first few logic levels account for most off ’s ODCs.2
71
In this example, we observe that most ODCs are due to only a fewlevels of logic.
This trend is made clearer by considering the ODCs off with respect to other nodes
when f has more than one fanout. ConsiderODCg1( f ) andODCg2( f ). If we assume
thatg1( f (Y),A) andg2( f (Y),B) are disjoint decompositions and thatA∩B = /0, we can
express the probability of an ODC forf relative to outputsg1 andg2 by the following
theorem:
Theorem 3 PXi∈ODCg1g2( f ) = PXi∈ODCg1( f )PXi∈ODCg2( f ).
Proof. According to Corollary 1,PXi∈ODCg( f ) is independent of the implementation off .
The probability that an ODC exists for input vectorXi is the joint probability that there is
an ODC with respect to bothg1 andg2. Since,A andB are independent, the two relative
probabilities are independent, which results in the above relation.2
Example 2. If g1, g2, . . . gm are n-input ANDs that are fanouts of primary inputf ,
PXi∈ODCg1g2...gm( f ) = (
2n−2
2n )
m. As in the previous example, the probability is12, for n = 2
andm= 1. However, the addition of just one fanout significantly decreases this probability
to 14, for n = 2 andm= 2. 2
In this example, the presence of fanout counteracts the mechanism shown earlier for
producing don’t-cares. When a circuit contains many nodes with fanout, which is often
the case, the ODCs of a node are often due to the impact of only afew levels of fanout
logic.
Our approximate simulator can be inaccurate for circuits with reconvergent paths be-
cause our per-node computation treats the fanout cone of each immediate output as being
disjoint from each other. We now show why our approximate simulator rarely produces
72
false positives and negatives.5 To understand the impact of reconvergent paths on the accu-
racy of our simulator, consider the functionG(g1,g2), whereg1( f (y),A) andg2( f (y),B)
as before.G represents a reconvergent node. We note the following:
Theorem 4 The probability that approximate simulation produces an error in f ’s ODC
set is Perror ≤ (1−PXi∈ODCg1( f ))(1−PXi∈ODCg2( f )).
Proof. If Xi ∈ ODCg1( f ) andXi ∈ ODCg2( f ), then f is not observable at the inputsG. If
Xi ∈ ODCg1( f ) or Xi ∈ ODCg2, then the inputs toG can be accurately analyzed indepen-
dently since only one of the inputs experiences an observable difference. An error in the
approximate simulator can only occur when neitherXi ∈ ODCg1( f ) norXi ∈ ODCg1( f ). 2
According to Example 1, there is a high probability thatf is unobservable with respect
to a single node after a few levels of logic for certain commonly used functions with low
entropy, such as ANDs and ORs. When there are few reconvergent nodes, with respect
to the total number of nodes in the circuit (we have observed this empirically), the upper-
bound forPerror becomes very small. We also showed in Example 2 the impact of multiple
outputs on the observability of nodef . There may be nodes other thang1 andg2 that
fanout from f and increasef ’s observability independent of the reconvergence, reducing
the probability of error in the approximate analysis.
5.2.4 Performance of Approximate Simulator
In Table 5.2, we report the empirical data on the runtime effici n y of our approximate
ODC simulator. The first column indicates the benchmarks examined. The second column,
sim, gives the time required to generate only signatureS for each node. We use this as
5In Section 5.2.2, we explain that false positives and negatives can be tolerated in our framework.
73
circuit runtime(s)
sim simodc our approx
ac97ctrl 1 6 1
aescore 2 79 1
desperf 9 410 7
ethernet 4 76 2
mem ctrl 1 119 1
pci bdge32 1 28 1
spi 0 39 0
systemcaes 1 48 1
systemcdes 0 24 0
tv80 1 130 1
usb funct 1 11 1
wb conmax 3 69 4
Table 5.2:Efficiency of the approximate ODC simulator.
circuit consideringx downstream levels [95] our global
2 4 8 16 32 algorithm (s)
ac97ctrl 1.0 1.0 1.0 1.0 1.0 1.0
aescore 3.0 3.1 3.4 6.3 7.9 3.0
spi 0.4 0.5 0.5 1.8 11.2 0.4
systemcaes 2.3 2.4 2.6 11.9 1300.0 2.3
systemcdes 0.3 0.3 0.3 0.5 0.6 0.3
tv80 2.2 2.3 2.6 8.2 363.0 2.2
usb funct 2.2 2.3 2.4 2.8 3.3 2.2
Table 5.3:Runtime comparison between techniques from [95] and our global simulation.
a baseline to assess the cost of generating masks. The third column,simodc, shows the
time required to generateS∗ for each node using Equation 2.2. The fourth column,our
approx, shows the time to computeS∗ using the approximate simulator. The results
indicate that the approximate simulator’s runtime is comparable to that ofsim and is much
faster thansim odc. These results were generated by running 2048 random simulat on
vectors.
In Table 5.3, we compare our simulator that considers ODCs byexamining all down-
stream logic with the implementation in [95] where a local don’t-care analysis is per-
74
formed per node considering only a few levels of downstream logic. We show runtimes for
[95] as a function of downstream levels considered. Notice that our simulator accounts for
more don’t-cares while achieving better runtimes. In some circuits, likesystemcaes,
considering more levels of logic is prohibitive using [95] due to the depth of the circuit.
The runtime similarities between considering only 2 levelsand our implementation sug-
gests that the contribution of our ODC simulation is insignificant compared to the runtime
to generate the initial signatures and parse the design. Furthermore, these results show that
computing don’t-cares using anO(N2)− time algorithm can become prohibitive. When
ODC analysis needs to be performed repeatedly for each circuit change, such as reliability-
guided synthesis [49], inefficient ODC computation can become a significant computa-
tional bottleneck.
5.3 Concluding Remarks
In this chapter, we developed an efficient algorithm for computing don’t-cares using
functional simulation. Our strategy scales to large circuits and can compute global don’t-
cares whereas previous work is limited to examining smallerwindows of logic to compute
don’t-cares. By efficiently analyzing don’t-cares throughout the circuit, we can potentially
expose more optimizations, which is especially important lte in the design flow where
fewer opportunities for improvement exist.
75
Part III
Improving the Efficiency of Formal
Equivalence Checking
The results of bit signature-based circuit analysis and transformations must be verified by
formal methods to ensure correctness for all input combinatio s. Generating high-quality
signatures using the techniques of the previous chapters inc eases the likelihood that our
the transformations suggested by our abstraction are correct. However, even if the abstrac-
tion is generally accurate in guiding optimizations, verification is still necessary and can
be prohibitively time-consuming, especially for larger designs. In this part of the disser-
tation, we propose a solution to accelerate the verificationof signature-based abstractions.
Chapter VI introduces a strategy to minimize the size of the logic block considered when
verifying an abstraction. In Chapter VII, we propose a parallel methodology for general-
purpose SAT solving that relies on increasingly prevalent multi-core systems as a means




Incremental Verification with Don’t Cares
In previous chapters, we have introduced techniques for improving the quality and flex-
ibility of bit signatures. These signatures can efficientlyidentify logic optimizations be-
cause of their ability to distinguish nodes with a small number of input vectors. However,
if one desires to determine whether two nodes are equivalentwhen their corresponding sig-
natures are equal, a formal proof mechanism is needed to check for possible corner-case
behavior not captured by the given signatures. Therefore, refining the simulation [59] is an
important mechanism to limit the number of signatures that falsely suggest equivalence,
thus minimizing the number of expensive proofs.
Additionally, incorporating observability don’t-cares into signatures introduces new
challenges to both producing high-quality simulation and verifying the correctness of the
abstraction. In this chapter, we address these challenges by introducing an incremental
verification strategy that dynamically adjusts the complexity of the verification instance
based on the amount of downstream logic required to prove equivalence up to don’t-cares.
In Section 6.1, we outline and formalize some of the challenges involved in verifying
abstractions with don’t-cares. In Section 6.2, we introduce our incremental verification
methodology, and provide concluding remarks in Section 6.3.
77
6.1 Verifying Signature Abstractions
Using a SAT solver to verify equivalence can be computationally expensive. However,
a high-quality selection of simulation vectors limits the number offalse positives.1 In gen-
eral, random simulation generates signatures capable of distinguishing two independent
random functionsf andg with n inputs. In this case, the probability that the signatures
incorrectly indicate equivalence,Perror, is simply the joint probability thatSf = Sg and
f 6= g. Underk input vectors this is:
Perror = P
(












wherePerror decreases exponentially ask increases. The term12k corresponds to the prob-
ability that Sf = Sg for k input vectors. The term 1− 122n corresponds to the probability
that twon-input independent random functions are not equivalent (where the number of
n-input Boolean functions is 22
n
). For this case, a small number of random simulation
vectors is sufficient to distinguish nodes and avoid false positives.
Logic functions implemented by practical circuits exhibitstructural properties and are
often dependent on one another. We can account for this in ouranalysis by defining the
DIFFSET between functionf andg.









whereONSET( f ) is the set of minterms off andOFFSETis its set of maxterms. Equiv-
alently theDIFFSET( f ,g) = ONSET( f ⊕g).
1Here we use the termfalse positiveto refer to incorrect equivalent nodes suggested because ofa signa-
ture match.
78








In other words, this is the probability thatSf andSg are equal.
|DIFFSET( f ,g)|
2n is the fraction
of input combinations wheref and g are different. As the number ofk input vectors
increases,Perror decreases.
For functions encountered in practice,|DIFFSET( f ,g)| is often fairly large, indicating
that random simulation would rarely produce signatures leading to false positives. For
instance, OR functions have 2n−1 minterms, AND functions have 1 minterm, and XOR
functions have22n . These common associative functions can often be distinguished from
each other quickly by simulation because they exhibit significant differences.
The NOR function has only 1 minterm, as the AND function; therefo e a large number
of input vectorsk is needed to achieve a lowPerror when comparing the signatures of
AND and NOR. To reduce the size ofk needed to distinguish nodes,simulation refinement
[47, 59] is commonly performed through SAT-generated counterexamples. In simulation
refinement, a miter is constructed between two nodes with matching signatures, and a SAT
solver attempts to satisfy its output. If a solution is found, the solution vector (dynamic
simulation vector) is applied to the circuit so that the signature of each node in the circuit
increases to sizek+ 1. Not only does this new vector distinguish the two nodes, but it
typically also improves the quality of the signatures in thenodes’ fanin and fanout cones.
The impact of don’t-cares. When ODCs in the circuit are taken into account, more
input vectors are usually required to achieve the samePerror between f and g. Given
ODC( f ), we wish to check whetherg can implementf in the circuit. To do this, we check
79




f is the ODC mask off ). In other words, we check if the
signatures of the two nodes match, after masking the don’t care bits of f . The impact on




|DIFFSET( f ,g)−ODC( f )|
2n
)k(6.4)
where the elements inDIFFSET( f ,g) that are inODC( f ) are removed. As a result,
|DIFFSET( f ,g)−ODC( f )|
2n is the fraction of input combinations wheref andg haveobservable
differences. In some cases, internal nodes in the circuit are not easily controllable, and
hence a largek is needed to limitPerror.
Limitations of previous approaches. The equivalence of two nodes,f andg, in a
network can be determined by constructing a miter [13] betwen them and asserting the






∀X F(X)⊕G(X) 6= 1
)
(6.5)
whereX is an input vector.
Since exploiting ODCs entails including downstream logic,verifying ODC-based merg-
ers could require a miter on the primary outputs of the circuit. Figure 2.4 shows how ODCs
can be identified for a given node in a network. In a similar manner, we can prove whether
the signatures ofb anda match up to ODCs. Instead of usinga′ in the modified circuitD∗,
b is substituted fora and miters are constructed at the outputs. If the care-set det rmined
by Equation 2.1 is null,2 b matchesa. A single satisfiable solution is needed to expose
a difference betweena andb. Notice that this approach requires the entire circuit to be
2C(a) =
S
i:D(Xi) 6=D∗(Xi) Xi .
80
considered, resulting in large SAT instances.
6.2 Incremental Equivalence Checking up to Don’t Cares
To improve the quality of equivalency checkers, we propose an incremental verification
framework where the size of the SAT instance is dynamically adjusted between each SAT
solver call. We only consider the smallest required logic block to determine equivalence.
Furthermore, by reusing internal data structures between SAT calls, decision heuristics
used in SAT solving [65] can be refined. Many learnt clauses [80] can also be reused
between calls to prune the search space and boost the performance of the SAT solver.
Our incremental strategy has an important advantage — equivalence analyses that are not
critical can be aborted if their verification takes too much time. In other words, we can
use the runtime cost of verification as a factor in determining whether verifying a match is
worthwhile.
6.2.1 Moving-dominator Equivalence Checker
We introduce here a SAT framework that determines equivalence i the presence of
don’t-cares by considering only a small portion of downstream logic. Consider Figure 6.1,
whereg is a candidate node to be merged withf up to don’t-cares. If a miter is constructed
acrossf andg instead of the primary outputs as shown in part a), a set of differences be-
tween f and g that results in satisfying assignments is given byDIFFSET( f ,g). (A
satisfying solution here indicates the non-equivalence for the given circuit nodes.) If one
of these differences between the two nodes is observable at th primary outputs (by exam-
ining the downstream logic off ), then non-equivalence that considers ODCs is proven. If
none of these differences are observable or if theDIFFSET is null, theng can be merged
81
with f .
However, if f andghave a large sizeDIFFSET, this could lead to a prohibitive amount
of simulation since each difference inDIFFSET is propagated from nodef to the circuit’s
outputs. To reduce the size ofDIFFSET, we construct miters farther from the potential
merger site at nodef while minimizing the amount of downstream logic consideredin the
mitered circuit. We introduce the notion of adominator seto define where we place the
miters.
Definition 6.2.1 The dominator set for node f is a set of nodes in the circuit such that
every path from node f to a primary output contains a member inthe dominator set and
where, for each dominator member, there exists at least one path from node f to a primary
output that contains only that member. Multiple distinct dominator sets can exist for a
given node.
6.2.2 Verification Algorithm
In part b) of Figure 6.1, we show miters constructed for a dominator set off . Domina-
tor sets close to the source nodef result in simpler SAT instances but potentially require
more downstream simulation to check whether the satisfyingassignments indeed prove
the equivalence of andg. We devise a strategy that dynamically moves the dominator
set closer to the primary outputs depending on the satisfying assignments generated. Our
“moving-dominator” algorithm is outlined in Figure 6.2.
The moving-dominator algorithm starts by deriving a dominator set that is close to the
merger site given bycalculate initial dominator(). Then thedom SAT()
function solves an instance where miters are placed across the current dominator set.
82
Figure 6.1: An example that shows how to prove that nodeg can implement nodef in the
circuit. a) A miter is constructed betweenf andg to check for equivalence,
but it does not account for ODCs because the logic in the fanout cone of f is
not considered. b) A dominator set can be formed in the fanoutc ne of f and
miters can be placed across the dominators to account for ODCs.
An UNSAT solution implies that the two candidate nodes are indeed equivalent, and
the procedure exits. If a satisfying solution is found, it ispropagated on downstream
logic from the current dominator set. If the input vector corresponding to the satisfying
assignment does not result in an ODC atf , then nodeg cannot implementf . Other-
wise, the procedure must be refined: a new dominator set is generat d as determined by
calculate new dominator(), which moves the miters closer to the outputs.
With each invocation of the SAT solver, we add constraints that are particular to
the current dominator set, as well as increase the size of theSAT instance to account
83
bool odc match(f , g){
current dom= calculateinitial dominator();










Figure 6.2:Determining whether two nodes are equivalent up to ODCs.
for the additional downstream logic considered. When the dominator set is adjusted by
calculate new dominator(), some of the constraints needed for the previous dom-
inator set are no longer relevant; we remove these constraints nd add new ones to the
SAT instance. By incrementally building the SAT instance each time the dominator set is
moved, we can reuse information learned by a SAT solver between s veral SAT calls.
ATPG techniques can also be substituted for the SAT-engine described in the previous
algorithm. By placing a MUX with a dangling select input betwen the two nodes in the
potential merger, we can generate test patterns foringle-stuck-at faults(SSF) on the MUX
select input. If a test pattern cannot be generated, the merger can take place because both
nodes have the same effect on the outputs. Similarly, the circuit considered can be limited
by the dominator set, and a test pattern counterexample can be used to refine it.
6.2.3 Calculating Dominators
Using simulation, we calculate a dominator set that attempts to minimize the amount of
downstream logic necessary to prove a merger. In general, wecheck the downstream logic
required to prove specific ODCs for certain input combinations and use that to determine
84
an initial dominator set. We then use counterexamples produced by the SAT solver to
refine the dominator set. Details of this approach are outlined below.
In Figure 2.4, ODC(f) is derived by examining observabilityat the primary outputs.
However, by placing miters along a cut defined betweenf and the primary outputs, it is
possible to calculate an ODC-set forf , ODCcut( f ), whereODCcut( f ) ⊆ ODC( f ). Previ-
ously, we defined this cut as the dominator set. An ideal dominator set would be the closest
cut to the merger site sufficient to prove equivalence. We define the minimal dominator
set as follows:
Definition 6.2.2 The minimal dominator set Dmin for proving that g can implement f is
the closest cut to f such that DIFFSET( f ,g) ⊆ ODCDmin( f ).
The functioncalculate initial dominator() is used to calculate an initial
dominator set. We randomly select several input vectorsXi and generate an approximate
Dmin using Definition 6.2.2 by constructingDIFFSET( f ,g) andODC( f ) from theXis.
Since not all input vectors are considered, it is possible that t e cut obtained is an under-
approximation and that the SAT solver fails to detect equivalence. To improve the ap-
proximation,calculate new dominator() extends the cut farther fromf for every
satisfying assignment found bydom Sat().
6.3 Concluding Remarks
We introduced an incremental verification methodology to reduce the complexity of
SAT instances when verifying our signature abstractions. Since many ODCs occur within
few logic levels from the focus circuit node [95], ODC analysis through even a small num-
ber of logic levels can bring significant runtime improvements. Our dynamic approach
85
finds the smallest logic window to verify a node merger that requires ODCs and produces
counterexamples to refine signatures accordingly. In laterchapters, this incremental veri-
fication algorithm is used to verify optimizations in the presence of don’t-cares, where it




As shown in the last chapter, our bit signature-based transformations rely on SAT-based
equivalence checking for validation, occasionally requiring the solution of very complex
instances. We observe that SAT computation can be a runtime bottl neck in our signature-
based synthesis framework. In this chapter, we propose a novel parallel SAT strategy to
exploit increasingly prevalent multi-core architectures, which feature a large shared mem-
ory and have the ability to execute several threads simultaneously. Multi-threaded SAT
solving can be used to reduce the runtime of verifying signature-guided optimizations, so
that more powerful optimizations become practical. We discus the theoretical underpin-
nings of our approach to SAT parallelization and how it improves upon previous parallel
SAT strategies.
7.1 Parallel-processing Methodologies in EDA
“Intrinsically parallel” tasks, such as multimedia processing, may achieveN times
speed-up by usingN cores (assuming that sufficient memory bandwidth is available nd
that cache coherency is not a bottleneck). However, combinatorial optimization and search
problems, such as SAT-solving and integer linear programming, are much harder to paral-
lelize. The straightforward solution — to process in parallel different branches of a given
87
decision — often fails miserably in practice because such branches are not independent in
leading-edge solvers that rely on branch-and-backtrack. The recent “View from Berkeley”
project [7] designates these problems as one of thirteen core computational categories for
which parallel algorithms must be developed. In this chapter, w propose new techniques
to parallelize state-of-the-art SAT solving.
Figure 7.1: High-level flow of our concurrent SAT methodology. We introduce a sched-
uler for completing a batch of SAT instances of varying complexity and a
lightweight parallel strategy for handling the most complex instances.
As of 2008, most EDA frameworks are being rapidly extended tomake use of multi-
core architectures,i.e., run several cooperating threads in parallel. In particular, state-
of-the-art techniques for design optimization such as SAT sweeping [69, 95], SAT-based
technology mapping for FPGAs [52] and, in our case, logic resynthesis require solving
88
multiple SAT instances. In the case of many key EDA algorithms, the computation of these
SAT solutions constitutes their bottleneck, and solving them in parallel offers a chance to
speed up a broad range of EDA tools. Shared-memory systems and multi-core CPUs are
particularly amenable to such parallelization strategies.
We first introduce a novel framework for scheduling and solving multiple instances of
hard SAT problems on shared-memory systems such as multiprocess rs, as illustrated in
Figure 7.1. Different client applications (such as formal verification) produce several SAT
instances which are issued to our concurrent SAT solver. These instances are put into a
priority scheduler, so that easier instances are finished first, while harder ones are solved
using an XOR partitioning strategy.
The first problem we address is that of scheduling ofM SAT instances onN processors
whenM > N. Take, for example, the caseN = 1. If runtimes are known for each instance in
advance, then scheduling instances of increasing runtime guarantees the bestbatch latency,
i.e., as the sum of completion times of all instances from the beginning of the batch. In
other words, a long-running job does not delay numerous small jobs. Scheduling forN
processors, and withouta priori runtime information, is more involved, and our work
is the first to address this problem. Furthermore, many applications generate individual
SAT instances rather than batches — the technique we proposehandles this case as well.
The need to parallelize individual SAT instances arises prima ly when no other instances
remain to be solved to keep all available cores busy. In our framework, we parallelize the
hardest SAT instances after they have run sequentially for some time.
In this chapter, we achieve two performance goals: 1) minimization of the average
latency for solving a group of SAT problems while ensuring maxi um resource utiliza-
89
tion and 2) minimization of runtime for large problem instances by exploiting concurrent
resources. To reduce the average latency of a collection of SAT instances, we introduce a
novel scheduling algorithm that combines the benefits of time-slicing and batch schedul-
ing. We achieve a 20% average latency improvement over previous techniques while im-
proving resource utilization. To reduce the runtime for single large instances, we consider
a novel partitioning scheme based on including additional constraints to an instance to re-
duce the size of the search space. We exploit a theoretical result from [81] on randomized
polynomial-time algorithms, where adding a limited numberof andom XOR constraints
to a SAT instance can reduce it from one with multiple solutions to one with a single
solution. We are the first to apply this result to search-space partitioning in multi-core
SAT solving, circumventing a major pitfall common to parallel SAT solver algorithms,
i.e., unbalanced partitioning [89]. We further observe that search-space partitioning is best
performed when the random restart frequency is low, a commonpr blem when the initial
part of the search is conducted sequentially. We validate our parallel methodology by per-
forming extensive experiments on an eight-core system and improve resource utilization
by 60.5% over prior work based on solver portfolios.
In Section 7.2, we analyze the issues that are at the core of the hig variability of ex-
ecution for SAT solvers. Section 7.3 introduces our scheduling algorithm for handling
multiple SAT instances of varying complexity in a parallel stting. We discuss the limita-
tions of previously proposed parallel solutions in Section7.4. In Section 7.5, we present
a partitioning strategy that provides search-space division along with our strategy for load
balancing. We analyze the effectiveness of our approach in Section 7.6 and conclude in
Section 7.7.
90
7.2 Runtime Variability in SAT Solving
While DPLL SAT solvers typically struggle on randomly generat d instances, most
practical SAT instances possess regular structure and can be solved much faster. However,
it has been observed that many practical instances experienc exponential runtime variabil-
ity [39] when using backtrack-style SAT solvers even without algorithmic randomization.
This variability can be observed by comparing runtimes of different algorithms on a given
instance, and can be formalized through the notion of heavy-t il behavior, summarized
below.
Definition 7.2.1 For a random variable X, corresponding to the search cost fora partic-
ular heuristic, a heavy-tail probability distribution exists if Pr[X > x] ∝ x−α as x→ ∞ for
0 < α < 2.
If the cumulative probability does not converge to 1 quicklyenough, the distribution ex-
hibits a heavy-tail. More specifically, the variance ofX is ∞, and whenα < 1 the mean
is also∞. In performance analysis of a single SAT solving algorithm with randomization,
or multiple SAT algorithms, the random variableX can capture the number of backtracks
required to solve a given instance. Also, since the maximum rntime is exponential, the
bounded heavy-tail produces variance that is actually exponential in the number of back-
tracks.
Random restarting (see Section 2.1.1), which is now extensiv ly used in DPLL-based
solvers and involves a worst-case polynomial number of resta ts, can eliminate heavy-tail
behavior [39]. Intuitively, random restarts prevent a solver from getting stuck in a diffi-
cult part of the search space. Portfolio strategies [38] offer similar benefits because each
91
heuristic tend to explore different parts of the search space. Furthermore, each heuristic
can utilize multiple restarting strategies, which in turn can produce more improvement.
Backdoor variables.We now discuss the impact ofbackdoorson the performance of
branch-and-backtrack types of SAT solvers.
Definition 7.2.2 Backdoor [85] variables for a SAT instance are a set of variables that
under some assignment produces a sub-problem solvable in poly omial time.
For example, a backdoor may yield a residual SAT instance that can be solved by a
linear-time 2-SAT algorithm.
Definition 7.2.3 Given a Boolean formula F(V) and a set of variables B⊆ V, B is a
backdoor if ∃AB[FAB ∈ P ∧FAB 6= 0], where AB ∈ {0,1}
|B| is an assignment to the set of
variables B.
In [85], it was observed that many common problems contain a sm ll backdoor set.
Definition 7.2.4 Given a Boolean formula F(V), a partial variable assignment B is a
strong backdoorif ∀AB[FAB ∈ P ].
There are 2|B| combinations that need to be examined to solve an unsatisfiable instance
for a total runtime of 2|B|P (FAB), whereP (FAB) is the runtime of the polynomial algo-
rithm under a given assignment. Empirical evaluation in [85] suggests that many practical
problems have|B| ∝ log(|V|) resulting in total runtime of|V|P (FAB) if the backdoor set
is known. Although determining this set is not always computationally feasible, decision
heuristics such as VSIDS implicitly look for such sets as they tend to favor variable assign-
ments that lead quickly to a full evaluation of an instance. It was also explained in [85] that
92
randomly generated instances tend to have considerably larger backdoors, approximately
30% of|V|. The efficient determination of a backdoor, explicitly or implicitly, is often key
to the performance of a branch-and-backtrack SAT solver.
7.3 Scheduling SAT Instances of Varying Difficulty
The goal of our scheduling strategy can be formally expressed a follows: givenM
different SAT instances and anN-threaded machine, we wish to solve them in a way that







whereTc(m) is the completion time for problem, andSt is the number of instances being
solved in a particular time-slicet. Note, whenN = 1, this formulation considers the case
of having a single thread of execution. Ideally, the completion imeTc for the last instance
mf when usingN > 1 threads, should beN-times smaller than forN = 1 to fully utilize
the parallel resources.
Optimizing the objective above, subject to resource constraints, can lead to a schedule
that minimizes the total latency for completing all SAT instances. Assuming that incoming
instances are independent and equally important to solve, minimizing latency is a way to
ensure that feedback is provided to as many clients as possible in a timely manner. This
may unblock the largest number of clients waiting for results (see also Figure 7.1). In the
case where the runtimes for all instances are approximatelyequal, optimizing the latency
objective is trivial as the problems can be solved in any order. However, as shown in
Figure 7.2, a block of instances can experience a wide variance in runtime. In particular, by
93
analyzing the distribution of runtimes from the SAT 2003 competition [51], which contains
several benchmark suites, we observe a bipolar trend whereby most instances either finish
in the first five minutes or timeout after 64 minutes. An optimal schedule for anN-threaded
machine involves scheduling problems in increasing order of complexity on each thread.
Unfortunately, predicting actual runtimes beforehand is not possible. However, we will
discuss strategies for mitigating this limitation later inthis section.
Figure 7.2: Number of SAT instances solved vs. time for the SAT 2003 collection. The
timeout is 64 minutes.
Because the distribution of runtimes is uneven, it is possible that, random schedul-
ing could result in some threads completing execution much after others, leading to poor
resource utilization. To even the execution latency acrossthreads, we can leverage sched-
ulers available in most operating systems, which usually exploit time-slicing. Through
time-slicing, problems with short runtimes finish fairly quickly, while longer instances
tend to complete at approximately the same time.
94
Our solution relies on an estimate of the distribution of SATruntimes to predict a time
threshold beyond which the unsolved problems are likely to have high complexity. We also
explore other techniques which are not dependent on predictive distributions to evaluate
possible overall better latency. From Figure 7.2, we see that this time threshold should
be approximately 5 minutes. Thus, for the first solving period, up to the threshold time,
we perform time-sliced scheduling over all the problems, after that we increase the thread
priority for only N instances (whereN is the number of threads available) so that they run
in batch mode.
To further reduce the average latency, we can lower the priority for instances that
require large memory resources, and thus negatively impactsystem performance. This
was unnecessary in our experimental evaluation since the instances we considered had
low memory profiles.
Figure 7.3: Percentage of total restarts for each minute of execution for a random sample
of instances from the SAT 2003 collection.
Although not implemented here, scheduling can be based on runtime estimates gener-
ated from progress meters found in some SAT solvers [4]. The thread priority for simpler
95
instances can be increased in this manner. For example, one could consider random restart
frequency, or the percentage of restarts performed each minute. In Figure 7.3, we show
a distribution of restarts over a randomly chosen sample of instances from the SAT 2003
collection. It reveals an exponential decay in frequency, which can be used as a guide to
lower thread priority. When few restarts occur, there are fewer opportunities to quickly
arrive at a solution due to a better variable order.
7.4 Current Parallel SAT Solvers
Previous efforts at parallelizing algorithms for solvingrandomSAT instances have
been effective as indicated in [58], but random instances arnot common in EDA appli-
cations whose, problems exhibit structure. For such instances, [55] represents the state
of the art, proposing a solution that exploits shared memoryt enable efficient learning
between solvers running on different threads. In this section, we overview some pitfalls of
this approach and discuss some limitations of portfolio solvers.
Search space partitioning using guiding paths, as proposedin [55], is limited because
the partitioning may be unbalanced. This may circumvent theeff ctiveness of random
restarts by forcing initial assignments to each concurrentsolver. Addressing this problem
by undoing the initial assignments for a thread after each random restart appears to under-
mine the benefits of partitioning. The partition itself may also generate sub-problems that
demand very different runtimes. Furthermore, learning betwe n threads is not always an
effective means of boosting performance. As discussed in [90], using1-UIP learnt clauses
is often more effective at improving the solver’s performance than using minimally-sized
learnt clauses. This counter-intuitive result suggests that parallel schemes for learning,
which often use the size of learnt clauses as a filtering mechanism, are not an effective
96
mechanism for boosting the performance of a particular thread of execution.
Implementing these parallelization strategies requires careful selection of a successful
sequential solver. Choosing a poor heuristic for parallelization still leads to poor perfor-
mance, especially in a portfolio where it consistently under-p rforms compared to other
heuristics. Furthermore, the heuristics implemented in the most successful SAT solvers are
finely-tuned, which would require much careful and time-consuming development when
porting to parallel optimizations. The slightest perturbation to the quality of the sequen-
tial algorithm caused by parallelization (such as excessivlearning between threads) can
significantly degrade runtime performance. For example, learning increases the size of the
clause database which, in turn, increases the cost of Boolean constraint propagation. Fur-
thermore, decision heuristics, such as VSIDS, are guided bylearning, and can therefore
be affected by it.
Portfolio solvers are advantageous because their implementatio overhead is minimal
and have low risk of performing poorly on instances with highly variable runtime. How-
ever, this approach requires that the various heuristics have different performance charac-
teristics on different types of instances. As larger computing systems become available,
it is increasingly difficult to find large collections of different heuristics. Furthermore,
even where orders-of-magnitude improvements are possible, ome instances may show no
improvement, resulting in small overall speed-up.
7.5 Solving Individual Hard Instances in Parallel
In this section, we propose an algorithmic methodology thatutilizes available resources
to reduce the runtime of hard instances. We overcome the limitations described previously
by introducing a novel approach for partitioning the search-space, which allows for more
97
flexible random restarts. Furthermore, our approach can be easily adopted by any state-of-
the-art DPLL-based solver.
7.5.1 Search Space Partitioning
Our technique for partitioning the search space of a SAT solver relies on the inclusion
of additional XOR constraints to the instance. In this section, we first elaborate on the
theoretical underpinnings of adding XOR constraints and then discuss its significance in
dividing a search space approximately evenly.
Reducing the search space through XOR constraints.To partition the search space,
we extend the work for solution-space reduction that was initially presented in [81]. The
authors of [81] show that the inclusion of an XOR constraint ((x1⊕ x2⊕·· ·⊕ xi ⊕0) as
shown in Equation 4.6) to an instanceF(V) probabilistically reduces its solution space by
approximately half. We call the instance obtained after adding this constraintFevenbecause
the assignments to thexi variables must have even polarity to satisfyFeven. Correspond-
ingly, we callFodd, the instance:
Fodd = F ∧ (x1⊕x2⊕·· ·⊕xi ⊕0)(7.2)
where thexi variables are the same as inFeven. {Feven,Fodd} is then a disjoint partition of
the solution space. More formally:
Definition 7.5.1 A disjoint partition exists when (1) F= Feven∨Fodd, (2) Feven∧Fodd = 0,
and (3) the set of variables xi ∈V is the same for Fevenand Fodd.
This partitioning generates two sub-problems that can be assigned to different solvers.
The sub-problems can be recursively divided by adding more XOR constraints. As a gen-
98
eralization of the result in [81], each XOR constraint probabilistically divides the num-
ber of possible assignments of theV variables roughly in half,i.e., 2|V|−1. Hence, the
constraint divides the search space approximately in half,probabilistically balancing the
workload between different solvers addressing the two sub-pro lems.
In practice, simply adding large XOR constraints is inadequate for reducing the search
space, because no conflict is generated until all of thexi variables are assigned, approxi-
mately |V|2 variables.
1 In other words, such a constraint divides the search space evenly,
but it is ineffective at restricting the search until after nearly all assignments have been
made. To address this, we investigate smaller XOR constraints, derived from the original
complex ones, that still achieve the same theoretical result.
Connection between backdoors and randomized reductions.As an example of
how we can add smaller constraints, consider a combinational circuit D with m inputs (a
more general strategy is presented in Section 7.5.2). This circuit can be converted to a
SAT instanceD(V) with V variables where the set of possible solutions determined by
assignments to the primary inputs isM, with |M| = 2m. Therefore, the set of solutions
(SD ∈ 2|V|) corresponds to the set of solutionsM. In other words, any assignmentAM ∈ 2m
results in precisely one solution. According to Definition 7.2.4,M is a strong backdoor for
D(V). By restricting the set of variablesxi to variables inM, we can construct a partition
that gives the same probabilistic guarantees as the original formulation, but produces a
smaller XOR constraint while generating conflicts sooner ifthese variables are assigned
first. Namely, an XOR constraint on the variables inM divides the solution space roughly
in half.
1In [81], each variable is randomly chosen to be in the XOR constraint with the probability of12.
99
7.5.2 Lightweight Parallel SAT
For a general SAT instance, we can restrict XOR constraints to involve only the typi-
cally small set of backdoor variables, where the XOR constraint can cut the search space
roughly in half to 2|B|−1.
Multi-threaded SAT framework. For a circuit, we showed that the primary inputs can
be used to derive small XOR constraints. In the following, wepropose a more general ap-
proach that approximately determines the backdoor set of variables to generate small XOR
constraint. Because computing the smallest backdoor set explicitly is not always feasible,
we use, as an approximation, highly ranked variables determin d by selection heuristics
in modern DPLL-based solvers like VSIDS. Since [85] observed that many backdoor sets
have cardinality log2(|V|), we choosexi from the top log2(|V|) variables to generate small
XOR constraints. To generate variable rankings, we run a SATsolver for a certain amount
of time (determined experimentally) before generating these XOR constraints.
Algorithm. In Figure 7.4, we show the pseudo-code of our algorithm usingXOR par-
titioning to improve the performance of SAT in a parallel environment.psat solve()
is a SAT solver invoked with the CNF instance (cnf), the number of random restarts af-
ter which the problem should be partitioned (passes), the mode of execution (the default
mode is sequentialseq), and any initial variable assignmentsa sumps. This allows very
simple instances to be completed sequentially andtrains the solver so that good variables
are chosen for partitioning. When partitioning is required, we add an XOR constraint
involving the toplog(|N|) variables throughadd xor constraints(). Because the
XOR constraint is typically small, we don’t require a specialized XOR constraint repre-
sentation as in [35]. We then spawn two threads and wait for thei results. Notice that
100
the threaded mode uses the same infrastructure as the sequential mode with only a few
minor changes. To maintain an even division of work between th two threads, we ensure
that the partitioning variablespart vars are ranked high (we increase their rank after
restarting). Because multiple variables are used to drive the partitioning constraint, there is
more flexibility in the search procedure than having an exactguiding path. Finally, in the
DPLL search() function, we share learnt clauses between threads when conflicts occur
to facilitate quick search-space pruning (this is similar to [55]). We expect our partition-
ing to produce sub-problems with similar characteristics,thereby making our inter-thread
learning more powerful. If one thread finds its instances unsatisfiable, we do not reparti-
tion the problem. We have observed that frequent repartitioning hinders the effectiveness
of the underlying sequential algorithm. In practice, we observe that the even partitioning
results in threads that compute for a similar amount of time.
Solution Space.It is possible to exploit the theoretical qualities of our patitioning and
note that the number of solutions to the SAT instance under study should be approximately
evenly distributed. Therefore, if one sub-problem is foundto be unsatisfiable, we can
estimate that the other sub-problems have none or very few solutions. This could be used
to guide the selection of a portfolio of solvers on-the-fly.
Note that, we do not partition a SAT instance until the batch-mode time threshold in
the methodology of Section 7.3 is reached. In addition, we may also partition an instance
when its restart frequency is low. This way, we reserve parallel computation only for
the hard problems and avoid deterministically partitioning the search space when variable
rankings change frequently. As a consequence, we can simplify our procedure in Figure
7.4 to not increase the ranking of variables chosen for the partitioning.
101
bool psatsolve(CNFcnf, int passes, Modemod=seq, Litassumps){
static Varpart vars;
initialize assumps(assumps);
while( not done() && (passes−− || mod!=seq)){




if (mod== seq && not done()){
part vars= top vars();
add xor constraints(cnf , part vars);
thread(psat solve, cnf, 0, parallel, neg);
thread(psat solve, cnf, 0, parallel, pos);
while(wait){
if (SAT) returnSAT;










if (top level conflict) returnUNSAT;
backtrack();





Figure 7.4:Parallel SAT Algorithm.
102
7.6 Empirical Validation
To evaluate our new solver framework, we consider SAT 2003 Competition bench-
marks [51] from thehandmade andindustrial categories, both including several
suites. The runtime of each benchmark is profiled using MiniSAT 2 [29] on a four-
processor dual-core Opteron system clocked at 1GHz with 16 GB of memory running
the Fedora 8 SMP OS. We set a timeout for each benchmark at 64 minutes and created a
distribution of runtimes over the entire suite. Our resultsindicate that most benchmarks
complete in either less than one minute or over one hour. Thishighlights the wide vari-
ance in runtime performance motivating our proposed methodology. Statistics for the
benchmarks as well as runtime distributions can be found in Table 7.1 and Figure 7.2
respectively.
SAT suite # benchmarks #SAT #UNSAT #TimeOut total time
> 64min (min)
handmade 353 48 90 215 13779
industrial 100 19 33 48 3160
Table 7.1:MiniSAT 2 results on the SAT 2003 benchmark suite.
7.6.1 Effective Scheduling of SAT Instances
We first consider an upper-bound on resource utilization by executing several problems
concurrently in the ideal case where each benchmark is roughly of the same complexity.
Here, we consider only small benchmarks from the suite previously analyzed and we show
how a multi-threaded machine can effectively be used so thatn threads result in approxi-
mately ann-times speed-up. This analysis, shown in Table 7.2, is vitalin showing that if
n independent problems are available, the corresponding expected speed-up is indeed pos-
103
sible. The slight deviation from ideal speedups is due to thevariation in runtime demands
from instance to instance. Below, we show our results for solving a set of instances with a






Table 7.2:Running MiniSAT on a set of benchmarks of similar complexityusing a varying
number of threads.
Scheduling SAT problems with varying complexity. To evaluate a parallel solving
methodology under a realistic distribution of runtimes, werandomly selected a subset of
benchmarks with a total runtime of∼ 32 hours, and with the distribution of Figure 7.2.
In Figure 7.5, we plot the performance of a non-ideal methodology that schedules the
SAT problems as a batch of jobsatch mode in an 8-threaded machine. Although the
total runtime for all the problems is approximately four hours, we note that several fast
problems are not scheduled until late in the batch. In particular, small instances tend to
be penalized in their latency. Thetime-slice mode uses the operating system to
schedule threads. Notice that although several simple instances finish early, the latency for
harder instances increases over batch mode. In ourpriority mode, we transition to
batch-mode by adjusting thread priorities after a time thres old is reached. Notice that the
integral of our priority mode plot is smaller, indicating better overall latency. We achieve
a 20% improvement in average latency overbatch mode and 29% improvement over
time-slice mode. Figure 7.5 shows wall-clock time; however, we have observed that
the system time is insignificant for each strategy (< 2 minutes). This is due, in part, to the
104
Figure 7.5: The number of SAT instances solved (within the time allowed) by considering
three different scheduling schemes for an 8-threaded machine. Our priority
scheme gives the best average latency, which is 20% better thanbatch mode
and 29% better thantime-slice mode.
efficiency of the OS scheduler along with the relatively small emory profile required for
the random slice of 55 instance considered.
7.6.2 Solving Individual Hard Problems
Ultimately, fast verification turn-around may require a faster solution of individual
hard SAT instances. Solvers such as SatZilla [87] try to exploit the fact that some solvers
perform better on certain classes of SAT problems than others. By carefully assigning
different solvers to each instance, one can improve runtimecompared to using any one
solver. In the parallel setting, the choice can be simplifiedby running until one of them
completes. However, unlike the single-threaded portfoliovariant, it is desirable that the
105
improved runtime is comparable to the extra computing resources required. Although
super-linear runtime improvement over the runtime of MiniSAT is possible due to the
high variability of performance of different approaches ona given problem instance, it
is important to achieve consistent improvements by exploiting available computational
resources. In the following analysis, we choose a subset of instances in the suites we
considered where MiniSAT requires significant computation(∼ 1 hour).
solver portfolio MiniSAT variants portfolio w/MiraXT portfolio w/pMiniSAT
heuristic # solved heuristic # solved heuristic # solved heuristic # solved
MiniSAT 6 m1 3 MiniSAT 6 pMiniSAT 5
Mira1T 0 m2 2 MiraXT 1 Mira1T 1
Jerusat1.3 1 m4 1 Jerusat1.3 0 Jerusat1.3 1
marchks 0 m5 1 marchks 0 marchks 0
picosat 2 m6 2 picosat 2 picosat 2
rsat 0 m7 1 rsat 2 rsat 1
zchaff 2 m8 1 zchaff 2 zchaff 2
HaifaSat 1 m3 1 - - - -
time(min) 321 326 335 200
speed-up 1.67 1.65 1.60 2.69
%util 20.9 20.6 20.0 33.6
Table 7.3:Hard SAT instances solved using 8 threads of computation with a portfolio of
solvers.
heuristic # solved heuristic # solved
MiniSAT 7 pMiniSat 8
picosat 2 picosat 2
zchaff 2 zchaff 2




Table 7.4:Hard SAT instances solved using 4 threads of computation with a portfolio of
solvers.
Table 7.3 shows the speed-up achieved by running multiple heuristics simultaneously
106
Figure 7.6: a) The percentage of satisfiable instances wheret first thread that completes
finds a satisfying assignment. b) The standard deviation of runtime between
threads. Using XOR constraints as opposed to splitting one variable can sig-
nificantly improve load balance and more evenly distribute solutions among
threads.
where we consider different solver portfolios. We highlight t e improvement of incor-
porating our approach in the last two columns. The total runtime without parallelization
for MiniSAT (variantm1 in Table 7.3) is 537 min. The heuristics columns list different
heuristics organized in a portfolio. We report the number ofhard instances that a partic-
ular heuristic solves the fastest. The first column shows a collection of state-of-the-art
SAT solvers. Notice that the speed-up on 8 cores is fairly small at 1.7, meaning that only
20.9% of the 8-times ideal speed-up is realized. The third column shows a portfolio of
different variants of MiniSAT given bym# produced by adjusting several tunable knobs
such as: restart frequency, variable decay rate, and decision heuristic. These results re-
veal similarly poor utilization where neither randomness nor different heuristics achieve
high utilization. We then tried running MiraXT [55] with twothreads but did not see ad-
ditional speed-up in the portfolio (one heuristic is removed from the original portfolio to
account for the extra thread required by MiraXT) . Because itperformance is dominated
107
by MiniSAT, parallelizing this solver is ineffective at increasing utilization. Furthermore,
the results reported in [55] consider only 2-threads with speed-up much less than 2. Addi-
tionally, we have observed that their heavyweight approachfor partitioning and learning
experiences diminishing returns when considering more threads.
By incorporating our parallel version of MiniSAT,pMiniSAT, discussed in Section
7.5 in the solver portfolio, we are able to achieve significant speed-up and higher utilization
of 60.5% with respect to the 8 threads of execution compared to the best solver portfolio
(pMinisat also requires 2 threads). Furthermore, in Table 7.4, we show that our utilization
is even better when considering only 4 threads. This indicates the limitation of large solver
portfolios, illustrating that our lightweight approach for parallelization can be beneficial
for achieving greater utilization by applying it across multiple heuristics.
7.6.3 Partitioning Strategies
We compared our XOR-based partitioning to a partitioning strategy with a single guid-
ing variable, a special case ofguiding paths[89]. In Figure 7.6, we show the effectiveness
of using XOR constraints for achieving balanced workloads among threads. Figure 7.6a
shows the percentage of satisfiable problem instances (out of 16 instances), where the first
thread that completes delivers at least one solution. We compare a single variable parti-
tioning strategy against XOR constraints of size 2−4 and consider parallelization using
2, 4, and 8 concurrent threads. Note that, in the 2-thread case, 100% of the threads that
finish first are satisfiable using XORs of size 4, compared to only 75% using one variable.
In general. this experiment reveals that our partitioning is more effective at distributing
solutions. We expect even better performance in application domains where the number


























































































Figure 7.7: The effectiveness of sharing learnt clauses by choosing the most active learnt
clauses compared to the smallest learnt clauses.
Figure 7.6b shows the runtime balance between 2, 4, and 8 threads. We examined dif-
ferent partitioning strategies on a set of 29 unsatisfiable problem instances and calculated
the standard deviation of thread runtime divided by averageuntime. We disabled learning
for this experiment to analyze more accurately how the search space is partitioned. For
the single variable partitioning for two threads, the normalized standard deviation is 0.35,
compared to a much smaller 0.22 for XOR-based partitioning with 4 variables. In general,
we observe almost a 2-time improvement in the runtime deviation between single variable
strategy and 4 variable XOR when considering different numbers of threads.
7.6.4 Parallel Learning Strategies
We note that efforts in previous parallel learning strategies focus on minimizing com-
munication and subsequently favoring small learnt clauses. The work in [55] incorporates
109
all learnt clauses within a size threshold. However, according to [90], the size of the clause
is not the best indicator for its effectiveness. We considerutilizing VSIDS to choose learnt
clauses that more effectively prune the search space relevant to the current sub-problem
being solved. We show our results in Figure 7.7 by comparing two different strategies
for sharing learnt clauses between 4 SAT solvers executing in parallel. Each SAT solver
chooses available learnt clauses ranked either by size as in[22] or by our strategy, which
uses the activity of the learnt clauses. We notice that this enhancement results in improve-
ments for most of the benchmarks considered.
7.7 Concluding Remarks
The computational complexity of SAT solving along with the runtime variability ex-
hibited between different solver heuristics challenges state-of-the-art parallel algorithms.
We proposed a two-part strategy for exploiting parallel processing more effectively, so
that more powerful SAT-based optimizations become practicl. First, we introduced a
scheduling algorithm that incorporates the approximate knowledge of runtime distribu-
tions for a given set of SAT instances to minimize average latncy over batch scheduling
by 20%. Since several instances require prohibitive amounts of runtime, we also proposed
a lightweight parallel SAT algorithm that effectively partitions the search space after first
exploring part of the search space sequentially. We observethat our partitioning results in
∼ 50% better run-time balance than simply choosing one splitting variable. Our strategy
enables us to improve resource utilization over solver portfolios by 60.5%. By incorpo-
rating our partitioning strategy with different SAT solvers, solver portfolios can be further




Improving Logic and Physical Synthesis
In the previous chapters, we have introduced strategies to improve bit signatures’ ability to
distinguish functionally different nodes and to verify thecorrectness of abstractions more
efficiently. We now leverage these advances to enable powerful logic optimizations guided
by signatures. First, in Chapter VIII we introduce novel logic transformations which would
require prohibitive amounts of computation without using signatures. Then, in Chapter IX
we use these novel transformations to enable powerful optimizations in post-placement




Bit signatures provide an effective means to approximate synthesis transformations. In
addition, the ability to encode don’t-care information in the signatures enables more op-
timization opportunities for the transformations considered. In this chapter, we introduce
two general techniques for using signatures to enable powerful optimizations. We first
describe a node-merging strategy that uses ODCs and achieves area reductions of 25%
on average. Then, we discuss a goal-driven synthesis technique, distinct from other logic
synthesis approaches, that can efficiently determine whether a logic implementation exists
for a topology corresponding to a desired subcircuit. In thenext chapter, we leverage both
of these synthesis strategies to restructure critical paths fter placement.
8.1 Logic Transformations through Signature Manipulations
Algorithms for logic synthesis typically operate on some representation of Boolean
functions that represent circuit nodes — algebraic expression , sums-of-products and other
Boolean formulas, such as BDDs, AIGs, etc. After logic synthesis, these representations
are converted back to circuits. To justify such manipulations by proxy, one has to ensure
that any circuit-based operation is faithfully represented by its counterpart on a given rep-
resentation. To demonstrate that signature-based abstractions satisfy this condition (where
112
operations on the signatures correspond to operations on the actual circuit), we formally
denote the assignment of a Boolean functionF to a value of 0 or 1 by the homomorphism
evalX, for an inputX. evalX gives a mapping of the Boolean function space for an input
vector to 0 or 1,i.e., 22
|X|














The symbol· denotes any Boolean operation in the Boolean function space, 2|X| →{0,1},
and symbol◦ denotes the corresponding bit operation.
For example, if· is the Boolean AND operation∧, then◦ is the bit-wise AND operation
&. The relation in Equation 8.1 indicates that, for any inputvector, evaluating the output of
a Boolean function (composed of Boolean functionsF andG) is equivalent to evaluating
the outputs ofF andG and applying the corresponding bit operation. By extendingthis
relation on one input vector toK input vectors, we produce the following mapping 22
|X|
→
{0,1}K, which is the signature of a function. Therefore, manipulating he signatures ofF
andG, SF ◦SG, is equivalent to generating a signature ofF ·G. The resynthesis ofH with
inputsF andG corresponds to the generation ofSH from SF andSG.
Example 2 For nodes, H, F, and G assume SH = {0,1,1,0}, SF = {1,1,1,0}, and SG =
{0,1,1,0} under 4 simulation vectors. SF&SG = SH = {0,1,1,0} where& is a bitwise
AND. If evalX(H) = evalX(F)&evalX(G) for all input vectors, H= F ∧G.
8.2 ODC-enhanced Node Merging
Merging equivalent circuit nodes is an effective techniqueto reduce the area of a logic
circuit. It scales to very large netlists, but, unlike BDD-based techniques to determine
113
equivalence, it requires non-trivial algorithms to identify potential mergers and verify the
results. Such algorithms for node merging were first developed in the context of formal
verification to detect possible cut-points in equivalence ch cking [34, 57]. To this end,
the work in [47, 59] uses a combination of SAT solving and simulation. Candidate nodes
for merging are first selected by checking whether their outputs correspond when stimu-
lated with random patterns applied to the design’s inputs. Then, their actual equivalence
can be verified using SAT. The simulation is refined through counterexamples generated
by SAT, which reduces the number of checks resulting in non-equivalence. Rather than
finding equivalent nodes as a post-processing step, the workin [59] improves equivalence
checking by merging equivalent nodes while constructing the mitered circuit. However,
incremental approaches, such as [59], do not allow for the det ction of ODCs because
no information about the downstream logic is maintained. Weshow that by taking into
account ODCs additional node mergers should be possible.
Because of the computational complexity involved in deriving ODCs, previous work
[95] tends to emphasize local computation as a synthesis optimization before technology
mapping. This emphasis is well justified for AIGs, which havemuch larger number of
internal nodes, and thus possible mergers, compared to mapped circuits. However, our in-
tended applications are in physical synthesis, where technology mapping can significantly
affect circuit delay, and the placement of standard cells iscrucial. In this context, fewer
nodes are exposed, and one must search for additional don’t-cares not found by existing
techniques. Thus, our goal is to quickly identify nodes equivalent up to global don’t-cares,
efficiently verify their equivalence, and use the results toimplify the design structure.
Additionally, our implementation can operate on mapped designs without requiring costly
114
netlist conversions, which otherwise lead to a loss in physical information and delay es-
timates. In the following, we first explain how merger candidates can be identified using
logic signatures and then provide empirical results.
8.2.1 Identifying ODC-based Node Mergers
In this section, we develop the theory involved in ODC-basedno e merging and de-
scribe the use of signatures to identify candidate mergers.
ODC-substitutability. Traditionally, a node merger can occur between nodea and
nodeb when they are functionally equivalent. We define node mergers b tweena andb in
the presence of ODCs when nodea is ODC-substitutableto nodeb.
Definition 8.2.1 Node a is ODC-substitutable to node b i f ONSET(a)∪ ODC(b) =
ONSET(b)∪ODC(b).
Whena is ODC-substitutable tob, a merger betweena andb means thata can be sub-
stituted forb. Because the ODCs of only one node are considered, ODC-substitutability
is not symmetric asb might not be ODC substitutable toa.
Using signatures and ODC-masks described in the previous chapter, we can define a
candidatemerger as follows:
Definition 8.2.2 Node a is a candidate for ODC-substitutability with node b ifand only if




b ], in other words, Sa is contained
within the range of signatures defined by Slob and S
hi
b .
where the⊆ relation is defined using the signatures of two nodes:
Definition 8.2.3 Sb ⊆ Sa if and only if Sb|Sa = Sa where| represents bit-wise OR.
115
circuit #candidates %incorrect %missed
(false positives) (false negatives)
ac97ctrl 63758 0.0 0.0
aescore 315917 0.1 0.0
desperf 296095 0.0 0.0
ethernet 8852009 0.3 0.8
memctrl 867145 1.0 1.4
pci bdge32 1158654 0.2 0.4
spi 156291 0.0 3.1
systemcaes 285189 0.2 0.2
systemcdes 5288 2.8 0.7
tv80 1348277 1.5 9.0
usb funct 1685374 2.2 1.8
wb conmax 1904773 0.0 0.0
Table 8.1:Evaluation of our approximate ODC simulator in finding node mrger candi-
dates: we show the total number of candidates after generatig 2048 random
input patterns and report the percentage of false positivesand negatives.
Therefore, by simple application ofS∗b, it can be determined thatis an ODC-substitutable
candidate withb. Similar to Definition 8.2.1, ifa is an ODC-substitutable candidate with
b, it does not imply thatb is an ODC-substitutable candidate witha.
The approximate ODC analysis is capable of finding many candidates while filtering
out false positives or negatives in the ODC mask due to the approximation of the simulator.
Table 8.1 shows the number of ODC-substitutability candidates for all nodes in the circuit
identified by our approximate simulator and the percentage of inc rrect candidates due
to false positives (%incorrect) in the ODC mask and missed due to false negatives
(%missed). In the experiment, we generated 2048 random input patterns to extract the
candidates. The results indicate that several candidates exist and that the number of false
positives and negatives is typically only a small fraction of the opportunities identified.
Finding candidates with signatures.Constant-time complexity hashing, as in [59],
cannot be used to identify ODC-substitutability candidates. Here, each node needs to apply
116
its mask to every other node to find potential candidates. Theresult is that forN nodes,
finding all ODC-substitutability candidates for a design requiresO(N2K)-time complexity,
assuming that applying a mask is anO(K)-time operation. Thus, we developed a strategy
that significantly reduces computation in practice. First,all of the signatures,S, in the
design are sorted by the value obtained by treating eachK-bit signature as a singleK-bit
number. This operation requiresO(NKlogN)-time. Then, for a given nodec, candidates
can be found by performing two binary searches withSloc andS
hi
c to obtain a lower and
upper bound on the sortedS, anO(KlogN)-time operation. Searching for complemented
candidates can be accomplished by simply complementingSloc and using this to derive an
upper bound. SimilarlyShic must also be complemented and used to derive a lower bound.
The following equation defines the set of signaturesSx that is checked for candidacy (we
ignore the case of negation for simplicity):
[
x
Sx i f num(S
lo
c ) ≤ num(Sx) ≤ num(S
hi
c )(8.2)
wherenumrepresents theK-bit value of the signature. This set is traversed linearly to find
candidates according to Definition 8.2.2.
8.2.2 Empirical Validation
Experimental setup. We developed our solution and relied on a specialized SAT
engine based on MiniSAT for validating candidates. We used ran om simulation patterns
to generate the initial ODC signatures. We used testbench cir uits from the IWLS 2005
suite [102]. Our experiments run on a Pentium-4 3.2 GHz machine. The ODC-based node-
merging algorithm examined each node in a circuit in one topological traversal. Each time
a merger is applied, the signatures in the fanout cone of the replaced node could become
117
inaccurate, due to different don’t care sets. However, since signatures are only used to find
candidates to be validated by a SAT solver, incorrect signatures can never lead to incorrect
mergers and updates are thus not necessary.
For the experiments on combinational simulation and equivalence checking, we extract
the combinational portion of the IWLS 2005 testbenches. In the experiment, every internal
node with a non-empty ODC-set is examined for merging opportunities; however, we
ignore mergers that increase the number of logic levels in the design. After completing
the analysis, we check the correctness of the transformations using the ABC’s equivalence
checking tool [98].
Post-synthesis optimization.In this section, we show that our global ODC analysis
discovers node mergers even after synthesis optimizations[62, 98]. These additional re-
ductions can be easily performed in conjunction with layoutinformation to help achieve
design closure.
To create a realistic experimental setup, we first optimizedth netlist of each circuit
by running a synthesis optimization phase in ABC [98], whichfurther compressed the
designs (the original netlist was mapped to a barebone set oflogic gates).1 The results
of this evaluation are reported in Table 8.2. The first column, #gates, gives the number
of gates in each design after synthesis with ABC. The second clumn gives the synthesis
optimization runtimes with theresyn2 script. We then report the number of ODC-based
mergers that we detected and applied, and the correspondingreduction in area. The final
column gives the additional runtime required by our merger algorithm. We set a timeout
of 5000s for the merger algorithm: for a few testbenches we reach d this limit and report
1We used theresyn2 script in the ABC package, which performs local circuit rewriting optimization
[62].
118
circuit #gates ABC(s) #merge %areareduct mergers(s)
dalu 1054 0 91 12.0% 10
i2c 1055 0 30 3.2% 3
pci spoci ctrl 1058 0 97 9.2% 6
C5315 1368 0 8 0.7% 2
C7552 1541 1 25 3.4% 8
s9234 1560 0 10 1.2% 8
i10 1884 1 38 1.3% 12
alu4 2559 1 469 22.9% 64
systemcdes 2655 1 111 4.7% 9
s13207 2725 1 15 1.8% 17
spi 3342 1 23 1.3% 84
tv80 8279 3 606 7.1% 1445
s38417 9499 2 33 1.0% 275
systemcaes 10093 4 518 3.8% 360
s38584 11306 2 150 0.8% 223
memctrl 12192 5 1797 18.0% 738
ac97ctrl 13178 3 185 2.0% 188
usb funct 15514 5 186 1.4% 681
pci bridge32 19872 6 82 0.1% 1134
aescore 21957 9 2144 8.6% 1620
b17 24947 6 224 1.6% 5000
wb conmax 49236 19 2433 6.2% 5000
ethernet 67129 28 45 1.4% 5000
desperf 80218 50 3148 3.7% 5000
average 4.9%
Table 8.2:Area reductions achieved by applying the ODC merging algorithm after ABC’s
synthesis optimization [62]. The time-out for the algorithm was set to 5000
seconds.
119
the improvements achieved within this time. Despite the ABC-based pre-optimization, we
observe that the designs can still be further optimized withimprovements of over 10% in
some cases.
Table 8.3 reports potential mergers when using don’t-caresFor this experiment the
netlists were generated by Synopsys DesignCompiler [104].The circuits were synthesized
with high effort and the results were mapped using the generic GTECH library. Columns
DC(s) andodc(s) give the runtime for running DesignCompiler and the node merging
algorithm, respectively. The runtime overhead of node merging is shown by%overhead
and it is small for most testbenches. The final two columns give the number of mergers
produced and the percentage of gates eliminated. The results indicate that even after state-
of-the-art synthesis, our node-merging application, which is a special case of our more
general proposed strategy, allows for additional area reductions in many circuits.
ODC locality. We now show that several levels of downstream logic are ofteninvolved
in proving equivalence with ODCs. Because of our efficient simulation and incremental
verification technique, we can enhance the local ODC analysis of [95] by considering node
mergers of unbounded depth.
In Table 8.4, we compare the percentage of mergers exposed using K levels of down-
stream logic, for K=1..5, against using unbounded K. Circuits were optimized as in the
previous experiment. The results indicate that most mergers can be detected using only a
few levels of logic. However, on average, our solution can detect 25% more mergers by
not limiting the depth of logic under consideration.
To evaluate the impact ofcircuit unrolling on merging opportunities, we devised a spe-
cific experiment. Circuit unrolling is a key step in bounded model checking and in finding
120
circuit #gates DC(s) odc(s) %overhead #merge %gatereduct
pci spoci ctrl 281 15 0 0 5 2.5
dalu 315 11 2 18.2 3 1.0
s9234 375 23 1 4.3 0 0.5
systemcdes 437 33 0 0 9 2.5
s13207 487 44 1 2.3 3 1.0
i2c 544 17 1 5.9 8 1.8
alu4 806 18 6 33.3 23 4.1
spi 821 44 2 4.5 4 0.7
C5315 828 14 2 14.3 6 0.7
C7552 1046 17 2 11.8 24 2.4
i10 1185 18 4 22.2 17 1.5
aescore 1758 293 3 1 29 1.8
tv80 1953 135 15 11.1 16 1.1
pci bridge32 2079 488 23 4.7 18 1.0
ac97ctrl 2119 284 12 4.2 35 1.7
systemcaes 2175 135 10 7.4 10 0.6
mem ctrl 2560 258 23 8.9 19 0.8
s38417 2578 236 36 15.3 28 1.2
s38584 3922 207 20 9.7 69 1.8
ethernet 4163 3053 47 1.5 25 0.6
usb funct 4718 293 44 15 36 0.8
wb conmax 9833 885 203 22.9 122 1.3
b17 11133 1041 343 33.0 87 0.8
desperf 12685 4719 216 4.6 255 2.1
average 10.7 1.4
Table 8.3:Gate reductions and performance cost of the ODC-enhanced node-merging algo-
rithm when applied to circuits synthesized with DesignCompiler [104] in high-
effort mode. The merging algorithm runtime is bound to13 f the corresponding
runtime in DesignCompiler.
sequential don’t-care opportunities in physical synthesis. This motivated us to investigate
if additional netlist compression opportunities were available for unrolled circuits. We
expect unrolled circuits to have higher potential for node mrging because of the larger
amount of combinational logic available. In the experiment, we considered a range of
sequential designs and unrolled them between 1 and 5 times; th n, for each scenario, we
compared the percentage of mergers discovered by considering only five levels of logic
121
circuit K=1 K=2 K=3 K=4 K=5 K=∞
dalu 9.9 14.3 19.8 31.9 38.5 100
i2c 36.7 53.3 60.0 66.7 80.0 100
pci spoci ctrl 21.6 51.5 67.0 84.5 93.8 100
C5315 87.5 87.5 87.5 87.5 87.5 100
C7552 36.0 64.0 64.0 68.0 72.0 100
s9234 0 0 20.0 20.0 40.0 100
i10 15.8 28.9 60.5 71.1 86.8 100
alu4 13.2 26.9 35.2 42.6 50.1 100
systemcdes 26.1 38.7 60.4 74.8 86.5 100
s13207 13.3 46.7 60.0 80 93.3 100
spi 60.9 82.6 91.3 95.7 100 100
tv80 11.9 23.4 38 49 56.3 100
s38417 12.1 54.5 78.8 100 100 100
systemcaes 21.6 45.8 70.5 72.8 73.9 100
s38584 17.3 55.3 70.7 82.0 85.3 100
memctrl 26.5 43.0 55.4 68.3 77.0 100
ac97ctrl 63.2 88.1 93.5 96.8 97.8 100
usb funct 42.5 69.4 81.7 87.6 91.4 100
pci bridge32 45.1 54.9 68.3 78.0 87.8 100
aescore 9.7 15.4 22.9 31.6 42.3 100
b17 21.4 30.4 35.7 42.4 44.2 100
wb conmax 7.9 16.5 26.0 36.5 48.5 100
ethernet 31.1 48.9 68.9 77.8 84.4 100
desperf 16.8 27.4 39.4 55.7 74.0 100
average 27.0 44.5 57.3 66.7 74.6 100
Table 8.4:Percentage of mergers that can be detected by considering only K levels of logic,
for various K.
circuit unrolling depth
1 2 3 4 5
i2c 80.0 57.0 42.8 43.1 43.2
pci spocctrl 93.8 87.8 86.5 84.8 84.5
s9234 40.0 51.4 42.0 38.2 42.9
systemcdes 86.5 85.3 88.7 86.2 86.3
spi 100 70.7 71.7 64.6 67.5
ac97ctrl 97.8 83.2 64.2 46.6 38.9
average 83.0 72.6 66.0 60.6 60.6
Table 8.5:Comparison with circuit unrolling. Percentage of total mergers exposed by the
local ODC algorithm (K=5) for varying unrolling depths.
122
versus considering the whole unrolled netlist. As shown in Table 8.5, for a few of the
circuits, the percentage of mergers missed by local ODC computation is highly affected
by the unrolling depth: the more the circuit is unrolled, thehigher the missed fraction.
An example isac97 ctrl where, with no unrolling, only 2% of the mergers are missed;
however, with an unrolling depth of 5 the missed percentage becomes 60%. On one hand,
the local analysis has better performance (we could not showt e full range of results for
all designs because of timeout conditions). On the other hand, our solution presents better
flexibility to adjust to a wide range of design sizes.
Framework assessment.Table 8.6 shows the quality of our signature-based frame-
work on unoptimized circuits by assessing the effectiveness of signatures in finding good
merger candidates.#merge gives the number of mergers applied to a circuit. We then
show the number of SAT calls required to prove the correctness of each merger, along with
the corresponding percentage of those calls that confirmed equivalence (columns#SAT
and%equiv). Merger candidates that required over 10 seconds to be verified were timed-
out, so to favor faster mergers. The column#dyn-sim denotes the number of dynamic
simulation vectors derived from counterexamples providedby the SAT-based verification
engine. The final column shows how many SAT calls were pruned because of the inclusion
of the dynamic vectors.
The results indicate that, on average, almost 50% of the SAT calls result in ODC
merging. Moreover, it is clear that the use of dynamic simulation vectors had a great
impact on this high-quality result. Reducing the number of SAT calls is key because SAT-
based equivalence checks contribute to most of the runtime cost. Furthermore, the dynamic
vectors added are typically much fewer than the number of false positives pruned.
123
circuit #merge #SAT %equiv #dyn-sim #prune
i2c 39 206 18.9% 167 36960
pci spoci ctrl 170 472 36% 302 34345
alu4 697 1306 53.4% 609 273497
dalu 636 1040 61.2% 404 25808
i10 257 580 44.3% 323 22029
spi 112 557 20.1% 445 78721
systemcdes 255 287 88.9% 32 153
C5315 161 192 83.9% 31 194
C7552 340 524 64.9% 184 107665
s9234 821 1959 41.9% 1138 514875
tv80 658 1781 36.9% 1117 832861
systemcaes 658 750 87.7% 88 8852
s13207 300 1007 29.8% 707 2208345
ac97ctrl 80 256 31.3% 176 26803
mem ctrl 2758 4356 63.3% 1580 2710618
usb funct 246 1739 14.1% 1493 1206172
pci bridge32 158 1189 13.3% 1031 2951017
s38584 2253 3610 62.4% 1357 3487613
aescore 2072 2317 89.4% 245 2205
s38417 636 2416 26.3% 1780 11544973
wb conmax 2313 5068 45.6% 2755 441002
b17 614 3588 17.1% 2974 21984143
ethernet 370 2084 17.8% 1509 2979472
desperf 2505 2614 95.8% 109 1198
average 47.7%
Table 8.6:Statistics for the ODC merging algorithm on unsynthesized circuits. The table
reports the SAT success rate in validating merger candidates and the number
of SAT calls that could be avoided because of the use of dynamic simulation
vectors.
124
8.3 Determining Logic Feasibility with Signatures
In the previous section, we showed that finding equivalent nodes in a circuit may lead
to significant area reductions. In this section, we introduce a goal-driven synthesis strat-
egy that can efficiently find a gate-level logic implementation for a given function. The
strategy can be applied in logic resynthesis to transform cicuit blocks so to optimize one
or more physical parameters (e.g., area, timing, etc). In the next chapter, we will apply this
technique in building circuit structures optimized for timing delay. To express our goal
more formally, we assume to have a subcircuit withm inputs,{a1,a2, ...,am} and outputF
to resynthesize, and we want to find several restructuring solutions that we can then eval-
uate based on their parameters. We represent the input subcirc it as a directed graphTF
with m incoming edges, one outgoing edgeF, andn internal vertices. Our goal is to deter-
mine whether there is a labeling,G∗, of n vertices with gatesg∈G, such thatF is logically
equivalent to the subcircuit that implementsTF , with respect to the outputs of the circuit.
In defining the data structures necessary to achieve our goal, we leverage a few previous
works. In particular, in [93],sets of pairs of functions to be distinguished(SPFDs) are
introduced as a way of representing a node’s functionality which can be used to exploit
circuit flexibility in logic optimization. In [77], the authors propose a technique that uses
SPFDs to find a logic implementation given a topological constraint, but their resynthesis
approach does not incorporate physical parameters such as timing and is limited to only
a few neighboring levels of logic to reduce the memory and computational requirements
of SPFDs. In an alternative strategy to reduce the memory requirements of SPFDs, the
authors in [94] choose a subset of SPFDs for a node using simulation and compatibility
don’t-cares in a logic rewriting application.
125
Because we efficiently encode global circuit don’t-cares, we are not limited by levels
of logic or required to have don’t-cares that are compatible. Furthermore, our approach
encodes the distinguishing bits in a compact data structurewith logic signatures so that
these operations can also be performed with bitwise parallelism. This is particularly bene-
ficial in our development of a novel goal-driven synthesis technique where fast evaluation
of topological constraints is essential to tightly couple physical optimization and logic
synthesis.
We now define several properties for graphTF that is the input to our strategy. We
define thelogic feasibilityof the graphTF as:
Definition 8.3.1 TF is logically feasible if∃G∗ONSET(TFc) = ONSET(F).
whereONSET is the set of input combinations for which the subcircuit produces 1 in
output. This definition can be relaxed by considering its relation within the care-set which
could be considerably smaller than 2m, due to controllability and observability don’t-cares.
Definition 8.3.2 TF is logically feasible up to circuit don’t-cares if∃G∗ONSET(TFc)∪
DC(F) = ONSET(F)∪DC(F).
whereDC is the don’t-care set.
A naı̈ve algorithm for determining the logic feasibility ofTF requires that every possi-
ble labelingG∗ is evaluated. Forn vertices, this requires checking|G|n labelings. If the set
of two-input logic functions is considered, there are 5n labelings.2 Furthermore, perform-
ing equivalence checking betweenckt(TF) andTF is an NP-complete problem. Below, we
2Although there are 16 different functions in the two-input Boolean function space over a switching
algebra, the tautology and two one-variable identity functions along with the negated form of each function
do not need to be explicitly considered.
126
discuss how signatures can be used to determine a minimal setof inputs that implements
a given function and how this can be extended to quickly determine logic feasibility up to
the signature approximation.
Pairs of bits to be distinguished.
Definition 8.3.3 A function F is said to be dependent on an input ai if: Fa=0⊕Fa=1 6= 0.
A similar relationship between the signatureSF of the functionF and input signatures
S1, ...,Sm can be established. In [19] it was observed that a set of inputsignatures can
implement a target signature if and only if every pair of different bits inSf is distinguished
by at least one of the input signaturesSm.
Definition 8.3.4 A pair of bits to be distinguished (PBD) is an unordered pair of indices
{i, j} such that SF(i) 6= SF( j).
Definition 8.3.5 A candidate signature, Sm distinguishes a PBD in SF if Sm(i) 6= Sm( j)
where{i, j} ∈ SPBDF where S
PBD
F is F’s set of PBDs.
Example 3. Assume a target signalSf = {0,0,1,1} and candidatesS1 = {0,0,0,1},
S2 = {0,1,0,1}, andS3 = {0,1,1,1}. The PBDs ofSF that need to be distinguished are
{0,2},{0,3},{1,2},{1,3}. Note thatS1 andS2 together cannot implementSF because
they do not distinguish{0,2}. However, if allSm are used, then all the bit pairs can be
distinguished and it is possible to construct a function that generatesSF from theSi. In
this exampleSF = S3 · (S1⊕̄S2). 2
Essential PBDs. Input signatures form an irredundant cover ofSF ’s PBDs when 1)
every PBD is covered by at least oneSi and 2) removing oneSi results in at least one
uncovered PBD. The resultingSi form the support of the function to be resynthesized.
127
Definition 8.3.6 A PBD that is distinguished by only one Si is anessentialPBD for Si .
According to the definition of an irredundant cover and PBDs,eachSi must have at least
one essential PBD (or else that input can be discarded). Becaus there is at least one
essential PBD for each input,SF is dependent onSi , independently of its implementation,
if the following condition holds:
∃iSF(Si=0) ⊕SF(Si=1) = 1(8.3)
In the case of functionF(a1, ...,am) resynthesis, we note that the cardinality of the irre-
dundant cover can be less thanm, becauseF may be independent of an inputai up to
don’t-cares and the signature abstraction might not exposea sufficient number of essential
PBDs. Furthermore, several irredundant covers are possible. In this paper, we greed-
ily determine irredundant covers by first selecting signatures that cover most PBDs and
continuing until all PBDs are covered.
Determining logic feasibility with essential PBDs.We now describe how the logic
feasibility of a given topology can be determined simply using signatures. In the next
chapter, we study how to create such topologies and how to verify the corresponding
signature-based abstraction. Our strategy assumes that the target library consists of all
two-input logic gates, so that each noden has exactly two input edges (although the initial
subcircuit can be mapped into any cells). In general, we do not restrict our topologies to
be fanout-freetrees (a topology is fanout-free if each noden in TF has only one outgoing
edge).
Note, however, fanout-free topologies form a critical aspect of our goal-driven syn-
thesis strategy because, under two assumptions, they produce circuits with optimal area
128
and timingif such a fanout-free circuit exists. First, we assume that each gate in the li-
brary requires the same area. Second, we assume that the delay through the subcircuit
is solely determined by its path length, that is, we assume that each wire is optimally
buffered. With these assumptions, fanout-free topologieshave smaller area than their non-
fanout free counterpartswhen implementing a single-output functionbecause they have
fewer internal nodes (m−1 nodes). Furthermore, fanout-free topologies have the samor
smaller delay as non-fanout free trees. The proof of this is straightforward because if a
non-fanout free topology has optimal delay based on path length, converting this topology
to a fanout-free tree by removing edges and nodes does not increase path length.
In the next few paragraphs, we introduce an algorithm for determining logical feasibil-
ity on fanout-free circuits where each primary input has only one outgoing edgeF , which
can be uniquely defined by anO(|SPBDF | ∗m)-time algorithm using signatures. Because
logic feasibility is not always possible for a fanout-free tr e that optimizes a particular
performance criterion, we extend our synthesis techniquesto handle arbitrary non-tree
topologies.
First, we associate a signatureSi to each input ofTF . If we assume that eachSi under
simulation distinguishes at least one essential PBD, we notthe following for each two-
input gate in a fanout-free topology:
Theorem 5 Given input signatures S1, S2, and the two-input functionΦ, the signature,
S1,2 = Φ(S1,S2) has all the essential PBDs of S1 and S2.
Proof. Any cut throughTF gives a set of inputs that implementsF . Therefore, the PDBs
of SF must be distinguished by each cut inckt(TF) for any feasible topology. Since in a
fanout-free tree,S1 andS2 do not reoccur in the topology, the output of the node combining
129
S1 andS2, S12, must contain their essential PBDs to distinguishSF . 2
As a direct consequence, each two-input transformation preserv s at least two essential
PBDs. Furthermore, PBDs that only occur in bothS1 andS2 must also be preserved to
uphold the invariant that every cut through the topology forms an input support.In a
similar manner, the work in [77] upholds this invariant in constructing a subcircuit but
considers SPFDs (i.e., sets of pairs of functions to be distinguished) instead.We note the
following:
Theorem 6 Given two input signatures where each one has at least one essential PBD,
there are at most two two-input Boolean functions (ignoringne ated version of these func-
tions) that can preserve all the essential PBDs.
Proof. A two-input Boolean function has a 4 row truth table with output 0 or 1. One
essential PBD adds the following constraint:
[Φ(a,b) = z]∧ [Φ(a′,b) = z′](8.4)
wherea, b, andz are variables with value 0 or 1. In other words, two distinct rows of
the truth table must have different values. For a givena andb where an essential PBD is
defined, there are only 2 such assignments toz that satisfy this constraint. The remaining
2 rows in the truth table can have any of 4 possible output combinations. Therefore, there
is a total of 8 different functions that satisfy this constraint. We ignore negated versions
of the Boolean function since that negation can be propagated to the inputs of later gates.
Given this, there are 4 distinct functions that can preserveone essential PBD. However,
since two essential PBDs must be preserved, the following costraint needs to be satisfied:
130
[Φ(a,b) = z]∧ [Φ(a′,b) = z′]∧ [Φ(d,e) = y]∧ [Φ(d,e′) = y′](8.5)
If {(a,b),(a′,b)} is disjoint from{(d,e),(d,e′)}, there are only 4 possible output combi-
nations ofz andy that satisfy the constraints, where 2 of them are the negatedform. This
is also the case if{(a,b),(a′,b)} is not disjoint from{(d,e),(d,e′)} (it is impossible for
two different functions to have essential PBDs on the same two ro s). Therefore, there
are at most only 2 distinct Boolean functions that can preserve the essential PBDs of its
inputs.2
If the fanout-free tree is traversed in topological order, achoice between two different
two-input gates is available for each node. In the worst case, ll possible combinations
must be evaluated to preserve all the essential PBDs, generating an O(|SPBDF |2
m)-time
complexity (there arem−1 nodes) for the final algorithm. For the typically small topolo-
gies that are considered for resynthesizing portions of thecritical paths, this result is a
significant practical runtime improvement over trying all possible gate combinations with-
out considering PBDs. Moreover, we note that in many cases the runtime complexity is
linear.
Theorem 7 Consider the following assumptions.
1. TF is an m-input fanout-free tree.
2. The m-input function F is completely specified by SF under simulation.
Under these conditions, the logical feasibility of TF can be determined in O(|SPBDF |∗m)
time in the worst case.
131
Proof. A fanout-free topology specifies a disjoint partition of theinputs. If an imple-
mentation exists with a disjoint partitioning of inputs, each internal node corresponds to
a function that is specified independently of the rest of the implementation. Therefore,
when the signatures completely specifyF (a complete truth table), each internal node is
also completely specified. Because of this, each two-input oeration must preserve at
least 3 essential PBDs (the minimal number of distinguishing bits a two-input function
can have) and therefore only one function satisfies this relation. Because there is only one
such candidate function, the complexity of finding an implementation isO(|SPBDF | ∗m). 2
Although we often resynthesize functions with small supports and therefore small truth
tables, a logic signature does not always completely specify a function’s behavior resulting
in a reduction in the number of bits that need to be distinguished. Also, the ability of
simulation to quickly identify circuit don’t-cares further reduces the number of bits that
need to be distinguished. By not having a completely specified function, we facilitate
multiple feasible implementations. Despite the advantages of this flexibility in determining
a feasible implementation, an internal two-input operation may only need to preserve 2
essential PBDs rather than 3, which can increase the runtimeof finding an implementation.
However, in practice, this runtime penalty is minor becausethe topologies are typically
small. Also, in many cases logical feasibility can still be determined inO(|SPBDF | ∗m) time
depending on which bits need to be distinguished.
Although we work with a functionally complete set of two-input gates, our approach is
capable of targeting any standard-cell library. This is done by allowing topologies where
each node can have more than two incoming edges. For a completely-specified fanout-free
tree, we still only require a linear traversal to discover whether a logically feasible imple-
132
mentation exists. Alternatively, after first decomposing the implementation to two-input
gates (where this decomposition already improves the physical characteristics), further
improvement by applying technology mapping using larger cells may be possible.
In some cases the optimal topology with respect to a given performance goal is not
logically feasible. Furthermore, some very common functions such as the multiplexor
function cannot be implemented using a fanout-free topology. Therefore, a viable tech-
nique must handle a broader family of topologies. We therefore describe how essential
PBDs can be used to guide synthesis for non-tree topologies wh re each operation pre-
serves at least one of its inputs’ essential PBDs. This facilit tes reconvergence and the
implementation of useful functions including multiplexors, as shown below.
Theorem 8 Consider a logic circuit with the following conditions:
1. At least one input to each node in the circuit does not fanout to another node at the
same or greater logic level.3
2. The only implementations considered are those where the signatures along each cut
through the topology form an irredundant cover.4
Under these conditions, the logical feasibility of an n-node topology TF can be deter-
mined in O(|SPBDF | ∗3
m) time.
Proof. By traversing the graph in topological order, note that at lest one essential PBD is
transferred to the output. Also, when the implementations are considered are those where
3The logic level of a node is determined by the path from the node t the primary inputs with the greatest
number of edges.
4In general, a topology may have an implementation with redundant covers. However, we focus on
implementations that do not use this redundancy to improve the efficiency of our approach.
133
the signatures along each cut of the topology form an irredundant cover, each signature
along the cut has at least one essential PBD. The constraintsin Equation 8.4 suggest that
there are four distinct two-input functions that preserve on essential PBD. However, one
of these functions will correspond to the 1 input identity function,i.e., a buffer (or inverter
in the negated case). Ignoring this case, there are three othr distinct functions can be tried
at each node, which requires no more than 3m total gate combinations to determine logic
feasibility.2
Handling arbitrary topologies with no implementation constraints requires more com-
putation where 5m gate combinations are examined. However, in practice, our app o ch
is faster than the naı̈ve enumeration described at the beginnin of the section because the
operations are performed on the signatures, not over the whole truth table. Also, essential
PBDs can still significantly prune the search space. Each cutmust still cover all of the
PBDs. If an edge from internal node or primary input does not appe r past a certain logic
level in the topology, its signature’s essential PBDs must be preserved across that level.
8.4 Concluding Remarks
We introduced two techniques that can enable powerful synthesis optimizations using
global don’t-cares, which is critical for post-placement optimization where less design
flexibility exists. We first presented a node-merging strategy that can operate di ectly on
mapped netlists. Unlike the work in [95], our techniques pursue global ODCs, which
are successfully evaluated against logic synthesis transformations. By exploiting global
don’t-cares, we identify several node mergers even after ext nsive synthesis optimizations,
resulting in up to 23% area reduction. Furthermore, our techniques are not restricted to
mapped circuits and can be used directly on AIGs in sequential verification applications.
134
In this context, global ODC analysis becomes more importantbecause of the greater depth
in unrolled circuits.
Finally, we introduced a novel, goal-driven synthesis strategy that quickly finds logic
implementations for arbitrary topologies. In the next chapter, we demonstrate the effec-
tiveness of this approach by targeting a critical path delayreduction optimization goal.




Path-based Physical Resynthesis using Functional
Simulation
In this chapter, we apply the scalable simulation-based framework developed through-
out this dissertation to the physical synthesis domain. In particular, we introduce (1) a
novel criterion, based on path monotonicity, that identifies those interconnects amenable
to optimization through logic restructuring and (2) a synthesis algorithm relying on logic
simulation and placement information to identify placed subcircuits that hold promise for
interconnect reduction. Experiments indicate that our techniques find optimization oppor-
tunities and improve interconnect delay by 11.7% on average, at less than 2% wirelength
and area overhead.
As mentioned in [10], many critical paths cannot be improvedthrough cell reloca-
tion and better timing-driven placement. Furthermore, theinaccuracy of timing estimates
before detailed placement limits the effectiveness of techniques from [40] in eliminating
path non-monotonicity. We target these non-monotone pathsfor resynthesis by generat-
ing different logic topologies that improve circuit delay.We use the synthesis strategy
introduced in Chapter VIII to efficiently determine whethera logic implementation for the
desired topology is possible.
In the example of Figure 9.1, we suggest that by applying our technique, a subcircuit
136
with a long critical path can be transformed to a functionally-equivalent subcircuit with
smaller critical path delay. Unlike most techniques from logic synthesis, the circuit re-
structuring can work directly on mapped circuits with complex standard cells. Compared
to work in [84], our approach exploits global don’t-cares toenhance logic restructuring. In
[53], redundancy addition and removal (RAR) are used to improve circuit timing. How-
ever, these rewiring techniques consider only a subset of our transformations, where we
use redundancy and physical information in conjunction to directly guide the resynthesis
of subcircuits containing multiple cells.
Figure 9.1: The resynthesis of a non-monotone path can produce much shorter critical
paths and improve routability.
Our experiments indicate that large circuits often containmany long critical paths that
can be effectively targeted with restructuring. Improvingthese paths results in consistent
delay improvements, of 11.7% on average, with minimal degradation to other performance
parameters. Furthermore, we achieve almost twice the delayimprovement of that achieved
by RAR-based timing optimizations. Our techniques are fastand scale to large designs,
whereas completely characterizing node functionality with BDDs would require a pro-
hibitive memory footprint.
In Section 9.1, we introduce our interconnect optimizationstrategy. In Section 9.2,
137
we propose a metric for finding circuit paths that require restructuring. Section 9.3 and
9.4 integrate these innovations in a novel physically-aware synthesis approach that uses
simulation. Empirical evaluation is presented in Section 9.5.
Figure 9.2: Improving delay through logic restructuring. In our solution, we first identify
the most promising regions for improvements, and then we restructure them
to improve delay. Such netlist transformations include gate cloning, but are
also substantially more general. They do not require for thetransformed sub-
circuits to be equivalent to the original one. Instead, theyuse simulation and
satisfiability to ensure that the entire circuit remains equivalent to the original.
9.1 Logic Restructuring for Timing Applications
We introduce a logic resynthesis approach that accounts forphysical aspects of per-
formance optimization, by leveraging our simulation-based framework discussed in the
previous chapters. We illustrate the approach in Figure 9.2. Starting from a fully placed
circuit, we identify critical paths using static timing analysis. We then apply a novel metric,
introduced in Section 9.2, that selects subcircuits for which logic restructuring could pro-
vide the greatest improvements. We restructure these subcirc its using bit signatures along
with physical constraints, to derive a topology that is logically equivalent to the original
one but exhibits better performance. Finally, we legalize th altered placement and up-
138
date the timing information. Beside timing improvements, this echnique could can target
other objectives as well, such as wirelength reduction. Using gnatures for restructuring
applications is advantageous because signatures can characterize internal nodes for netlists
mapped to standard cells as well as for technology-independent netlists. In contrast, other
logic rewriting strategies, such as the one in [62], cannot operate on technology-mapped
circuits and do not take physical information into account.
9.2 Identifying Non-monotone Paths
To maximize the effectiveness of our post-placement optimizations, we target parts
of the design with critical timing constraints that are amenable to restructuring. In this
section, we introduce our fast dynamic programming (DP) algorithm for finding non-
monotonepaths,i.e., paths that are not of minimal length. Unlike the work in [10]that
considers only paths with two wire segments, we consider paths of arbitrary lengths and
can scale to many more segments in practice. We propose two models f r computing path
monotonicity: (1) wirelength-based and (2) delay-based. Non-monotonic paths indicate
regions where interconnect and/or delay may be reduced by post-placement optimization.
9.2.1 Path Monotonicity
First, static timing analysis is performed to enable our delay-based monotonicity cal-
culation and identify critical and near-critical paths. Weuse a timing analyzer whose in-
terconnect delay calculation is based on Steiner-tree topologies produced by FLUTE [23]1
and the D2M delay metric [6] that is known to be more accurate than Elmore delay. Before
focusing on critical paths, we describe a general approach that examines the monotonicity




Dist: length of paths considered
output
NMF: NMF between each node
gen NMF(Nodesnodes, Dist K) {
levelize(nodes);
for eachnode1∈ nodes{
for eachnode2∈ range(node1+1, node1+K)




for eachnode2∈ range(node1succ, node1succ+K) {
subtot[node1,node2] = max(subtot[node1,node2pred]
+ c(node2pred, node2));





















































































Figure 9.4: Calculating the non-monotone factor for path{d,h}. The matrix shows sub-
computations that are performed while executing the algorithm n Figure 9.3.
140
of every path. We define thenon-monotone factor (NMF)for the path{x1, ...,xk} with








wherec(a,b) defines the actualcostbetweena andb andcideal(a,b) defines an optimal
cost. WhenNMF = 1, the path is monotone under the cost metric. We explore two
definitions for cost, one based on rectilinear distance and the o her on delay.
In the former case,c(a,b) is the rectilinear distance between cellaandbwhilecideal(a,b)
is the optimal rectilinear distance assuming a monotonic path. For the delay-based def-
inition, c(a,b) is theAT(b)−AT(a), whereAT is arrival time. We definecideal as the
delay of an optimally buffered path betweena andb as described by [67] and given by the
following formula:
cideal(a,b) = dist(a,b)(Rbu fC+RCbu f +
√
2Rbu fCbu fRC)(9.2)
whereR andC are the wire resistance and capacitance, respectively, andRbu f andCbu f
are the intrinsic resistance and input capacitance of the buff rs. dist(a,b) is the rectilinear
distance betweena andb. Unlike the distance calculation where the ideal path length
betweena andb can be equal to the actual path length, the optimal buffered wire between
a andb has delay≤ AT(b)−AT(a). We only attempt to optimize paths with large non-
monotone factors.
9.2.2 Calculating Non-monotone Factors
We now present our algorithm for calculating the NMF of allk-hop paths in a circuit,
for a givenk ≥ 2. Our experiments reveal the existence of high NMFs on even relatively
141
short paths, which is advantageous since optimizations on these smaller paths often mean
fewer perturbations to the existing placement while significant performance benefits are
achieved.
Figure 9.5: Our flow for restructuring non-monotone interconnect. We extract a subcir-
cuit selected by our non-monotone metric and search for altern ive equivalent
topologies using simulation. The new implementations are then considered
based on the improvement they bring and verified to be equivalent with an
incremental SAT solver.
The non-monotone factor can be efficiently computed for every path using aO(nk)-
time algorithm forn nodes in the circuit, as shown in Figure 9.3. First, the circuit is
levelized. Then,cideal is computed for node pairings with a connecting path of≤ k hops,
and the values are stored inc ideal array. All pairs are traversed again, and the
subtot is generated by computing the maximum cost fromnode1 to node2 through
142
Figure 9.6: Extracting a subcircuit for resynthesis from a non-monotone path.
a recurrence relation. The NMF is computed for the subpath,{node1, node2}, by
dividing the total cost,subtot, byc ideal[node1,node2]. In Figure 9.4, we show
an example computation on a subcircuit being traversed using thegen NMF() function
wherek= 3 and the currentnode1 isd. The matrix indicates the NMFs already computed
with #, and nodes not lying on the same path withX. Because we traverse the graph in
levelized order,a,b,c have already been examined. Notice, that nodes that are farther
thank hops away are not examined (indicated byK in the matrix). For noded, the non-
monotone factor is computed for path{d,h} by determining all the incoming sub-paths
to h first. In this example,{d,h} has the highest NMF if rectilinear distance is the cost
function.
143
Figure 9.7: Signatures and topology constraints guide logic restructuring to improve criti-
cal path delay. The figure shows the signatures for the inputsand output of the
topology to be derived. Each table represents the PBDs of theoutputF that
are distinguished. The topology that connectsa andb directly with a gate is
infeasible because it does not preserve essential PBDs ofa andb. A feasible
topology usesb andc, followed bya.
144
9.3 Physically-aware Logic Restructuring
We optimize the subcircuits that are identified by the path monot nicity metric as il-
lustrated in Figure 9.5. We first select a region of logic determined by the non-monotone
path for resynthesis. We then use signatures to find an alterntive implementation with a
topology that improves physical parameters and that it is logically equivalent to the orig-
inal implementation (up to the signatures). This implementation is then formally verified
by performing SAT-based equivalence checking between the original and new netlists.
Previous work on improving path monotonicity used logic replication [42]. However,
the technique is restricted to the topology of the extractedsubcircuit, and its optimiza-
tion is independent of the subcircuit’s functionality. Furthermore, as observed in [42], cell
relocation sometimes cannot improve path monotonicity. Inthe previous chapter, we in-
troduced the theoretical framework to resynthesize a subcircuit given a set of inputs and
a target output by using our algorithm for determining logicfeasibility. We now intro-
duce an algorithm for constructing subcircuits using signatures and physical constraints to
optimize the interconnect.
9.3.1 Subcircuit Extraction
After identifying the path that is least monotone, we extract a subcircuit (as shown in
Figure 9.6) with incoming path edges as inputs and outgoing edges as outputs. The inputs
and fanout of the subcircuit are treated as fixed cells, forming the physical constraints. As
shown in the figure, if there are outgoing edges at intermediat nodes in the path, this logic
is duplicated. In practice, we experience minimal cell areaincrease because the number of
duplicated cells is small, and the resynthesized circuit issmaller than the original in many
cases.
145
9.3.2 Physically-guided Topology Construction
In addition to efficientlydeterminingthe logic feasibility of various topologies, we
propose an algorithm that uses PBDs and physical constraints to efficientlyconstructlog-
ically feasible topologies. In this paper, we guide our approach using delay and physical
proximity. In the example shown in Figure 9.7, we try to find anoptimal restructuring to
implement the target functionF with inputsa, b, andc. The functionality of the original
circuit is represented by signatures. The figure also shows atable associated with each
signal showing the PBDs that are distinguished. The non-essential PBDs for each input
signature have light-gray background.
The example shows that the arrival time forc is the greatest, followed bya, thenb.
Therefore, we should consider alternative topologies wherec’s value is required later. We
also consider the proximity of the signals and therefore examine topologies where a direct
operation betweena andb is performed. Notice that if all possible two-input operations
are tried, the essential PBDs are not preserved and hence these are not feasible topologies.
We then consider another topology wherea can be accessed later and thus it generates
an operation connectingc andb first. For this topology, we observe that an XOR gate
preserves the essential PBDs. We then can easily derive thatan OR gate is needed to
implementF.
Algorithm. Figure 9.8 introduces the pseudo-code of the restructuringalgorithm
for non-monotone interconnect. After identifying the non-monotone paths,Optimize
Interconnect() restructures a portion of the critical path. Before restructuring the
path, we first simplify the signatures withsimplify signatures() by noting that
the size of the signature|SF | can be reduced to the number of different input combinations
146
that occur across{S1, ...Si}. Thus, only a subset of the signature is needed for restruc-
turing because the small subcircuits considered have a maximum of 2i possible different
input combinations, smaller than the number of simulation vectors applied.2
Optimize circuit(){
gen NMF(); num tries = X;
while(worst nmf > 1) {
if (nckt== OptimizeInterconnect(worst nmf)) {










while(find opt topology(constrs)) {






Figure 9.8:Restructuring non-monotone interconnect.
In find opt topology(), we find a topology that optimizes delay for the given
physical constraints, such as the physical locations of thesubcircuit’s inputs and outputs
The topology is created by a greedy algorithm which derives afanout-free topology from
the current input wires. We examine each pair of wires, applyan arbitrary cell, and esti-
mate the delay to the output of the subcircuit. The topology is then greedily constructed so
that wire pairs with earlier arrival times are favored in theearly computation stages of the
topology. From this initial topology, we can obtain an upper-bound for the best possible
2In our experiments, we apply 2048 input vectors and restructu e subcircuits with< 10 inputs.
147
implementation. If a topology can’t be found that satisfies the constraints, the function
returns.
The topology that is derived is then then checked for logicalfe sibility using PBDs
and signatures incheck logical feasibility(). If the topology is feasible, we
associate the appropriate gate with each vertex and place the subcircuit. Our placement
routine considers only the legality of the subcircuit (we call a placement legalizer later for
the entire design). In our approach, we determine a locationfor each gate by placing it at
the center of gravity of its inputs and outputs and then sifting he gate to different nearby
locations. This sifting is done over all the gates and over several passes until a locally
optimal solution is achieved, resulting in no overlaps. Forthe typically small subcircuits
considered, this requires a small computational effort.
Finally, if the topology is not logically feasible, we add afunctionalconstraint that
prevents the construction of similar topologies. The constraint states which wire pairs
should not be combined again. For instance, for the multiplexor, z = a′b+ ac, there is
no implementation with a fanout-free topology with inputs{a,b,c}. If a andb form a
wire pair, no implementation can preserve its essential PBDs. However, we can exploit
Theorem 8 and consider implementations that eliminate one of the inputs. In this case, if
the implementationa′b is attempted, the wireb does not need to reappear in the topology.
Therefore, a constraint is added so that the inputs to the topology are now{a′b,a,c}. With
these inputs, a fanout-tree does exist which is logically feasible.
If Optimize Interconnect() returns a subcircuit, we check the equivalence of
the entire circuit using a SAT engine. In the case where our candid te produces a function-
ally different circuit (which is rare, as shown in Section 9.5), we use the counterexample
148
generated by SAT to refine our simulation, hence improving the signatures’ quality. If the
resulting subcircuit passes verification, we update the netlist and legalize the placement.
We update the timing information and the NMFs if a new critical p th is found, in which
case we select with the next highest NMF and restructure it.
9.4 Enhancing Resynthesis through Global Signature Matching
Our resynthesis strategy considers the inputs to a non-monotone path for resynthesis.
This strategy is convenient because 1) the set of inputs can always implement the target
output and 2) the inputs tend to be physically close to the targe output. However, local
manipulations can be enhanced by incorporating global information. In this section, we
explain how to exploit the same advantages of structural hashing for area reductions, by
applying matching to the signature abstraction. Furthermore, ur approach is more pow-
erful than logic rewriting because the signatures are matched while considering global
don’t-cares, and our initial physically-guided local rewriting over signatures already ex-
ploits don’t-cares.
Strategy. To resynthesize non-monotone paths, we exploit signature matching in the
following way:
1. Find a set of candidate wires within a certain distance from the output wire to be
resynthesized.
2. Check whether any candidate’s signature is equal to the output signature up to don’t-
cares, as discussed in Chapter VIII. If a match is found and the timing can be
improved, replace the output wire with the corresponding candidate wire.
3. While checking logic feasibility in topological order, check whether any of the in-
149
ternal wires can be reimplemented using a candidate wire with a matching signature
to further improve timing.
The candidate wires are chosen by proximity to the output wire being resynthesized as
determined by its half-perimeter wirelength (HPWL). Any wire annotated with an arrival
time after the current output wire’s annotated arrival timeis not considered. Unlike the
resynthesis algorithm that uses a simplified signature, forsignature matching, we consider
the whole signature except for the don’t-cares. In this case, single comparison between
signatures can be performed quickly and it is more efficient than finding a common set of
inputs to both wires and then reducing the signatures to the number of simulated different
input combinations. Notice that our algorithm is used to enhance the previous resynthesis
strategy and improve the timing of a specific implementation, while in general topology
construction only the inputs to the subcircuit are considere .
9.5 Empirical Validation
We implemented and tested our algorithms with circuits fromthe IWLS 2005 bench-
mark suite [102], with design utilization set to 70% to matchrecent practices in the in-
dustry. Our wire and gate characterizations are based on a 0.18µm technology library.
We perform static timing analysis using the D2M delay metric[6] on Rectilinear Steiner
Minimal Trees (RSMTs) produced by FLUTE [23]; here FLUTE canbe easily replaced
by any timing-driven subroutine, without significantly affecting the overall trends of our
experiments. Our netlist transformations are verified using a modified version of MiniSAT
[29] and placed using Capo 10 [16]. We have considered several different initial place-
ments for each circuit by varying a random seed in Capo and report results as average
150
improvements over these placements. Our netlist transformations are legalized using the
legalizer provided in the GSRC Bookshelf [105].
To evaluate delay improvements, we apply the algorithm of Figure 9.8 to the test-
benches. We applied 2048 random simulation patterns initially o generate the signatures.
We considered paths of less than or equal to 4 hops (5 nodes) using our delay-based met-
ric, which allowed us to find many non-monotone paths while mini zing the size of the
transformations considered. We conducted several optimization passes until no more gains
were achieved.
9.5.1 Prevalence of Non-monotonic Interconnect
Figure 9.9: The graph plots the percentage of paths whose NMFis below the correspond-
ing value indicated on the x-axis. Notice that longer paths tend to be non-
monotone and at least 1% of paths are> 5 times the ideal minimal length.
Our experiments indicate that circuits often contain many non-monotone paths. In
Figure 9.9, we illustrate a cumulative distribution of the percentage of paths whose NMFs
151
is below the corresponding value on the x-axis. We generatedth se averages over all
the circuits in Table 1. Each line represents a different path-length examined, where we
considered paths from 2 hops to 6 using the wirelength-basedNMF metric. We also show
the cumulative distribution for the 4-hop delay-based NMF calculation used to guide our
delay-based restructuring. Of particular interest is the percentage of monotonic paths,i.e.,
paths with NMF = 1.
Notice that smaller paths of 2-hops are mostly monotone, whereas the percentage of
monotone paths decreases to 23% when considering 6-hop paths. This indicates that
focusing optimizations on small paths only, as in [10], can miss several optimization op-
portunities. It is also interesting to note that there are paths with considerably worse mono-
tonicity having NMFs> 5, revealing regions where interconnect optimizations areneeded.
We observe similar trends using our delay-based metric. Theinclusion of gate delay on
these paths results in greater non-monotonicity when compared to the wirelength-metric.
Although not shown, each individual circuit exhibits similar trends.
9.5.2 Physically-aware Restructuring
We show the effectiveness of our delay-based optimization by reporting the delay im-
provements achieved over several circuits. In Table 1, we provide the number of cells
and nets for each circuit. In thePerformance columns, we give the percentage delay
improvement, the runtime in seconds, and the percentage of equivalence-checking calls
where candidate subcircuits preserved the functionality of he entire circuit. We also re-
port the overhead of our approach with the percentage of wirelength increase and the
percentage of cell count increase.
Considering 8 independently generated initial placementsfor each circuit, our tech-
152
cell net performance overhead
circuit count count %delay time %equiv %wire %cells
improv (s) checks
sasc 563 568 14.1 41 100 2.35 3.13
spi 3227 3277 10.9 949 82 4.53 0.73
desarea 4881 5122 12.3 503 93 1.09 0.31
tv80 7161 7179 9.1 1075 71 2.50 0.17
s35932 7273 7599 27.5 476 100 2.14 0.19
systemcaes 7959 8220 13.9 748 95 0.89 -0.07
s38417 8278 8309 11.7 481 84 0.68 -0.21
mem ctrl 11440 11560 9.2 678 37 0.05 -0.02
ac97 11855 11948 6.3 245 100 0.44 0.02
usb 12808 12968 12.2 605 80 0.30 0.06
DMA 19118 19809 14.5 845 65 0.16 0.08
aes 20795 21055 6.4 603 100 0.13 0.01
ethernet 46771 46891 3.7 142 100 0.08 0.06
average 11.7% 85.1% 1.20% 0.34%
Table 9.1:Significant delay improvement is achieved using our path-based logic restruc-
turing. Delay improvement is typically accompanied by onlya small wirelength
increase.
niques improve delay by 11.7% on average. For some circuits, such ass35932, several
don’t-care enhanced optimizations enabled even greater delay improvements.
Note that, by optimizing only one output of a given subcircuit, we greatly reduce the
arrival time of the critical output, while only slightly degrading the performance of compu-
tation of other outputs. Moreover, through our efficient useof don’t-cares, severalm-input
subcircuits could be restructured to require fewer thanm i puts. As a special case of the
previous point, sometimes an input to the subcircuit is functio ally equivalent to the out-
put of the subcircuit when don’t-cares are considered, enabling delay reduction along with
removal of unnecessary logic. Signatures are efficient in exploiting these opportunities.
Finally, the decomposition of large gates into smaller gateprimitives through our restruc-
turing algorithm often produces better topologies becausewe construct a topology that
153
meets the physical constraints more precisely.
We also believe that further gains would be enabled by combining buffering, reloca-
tion, and gate sizing strategies in our restructuring optimizations. The wirelength and
cell-count overhead are minimal because only a few restructu ings are needed and the
optimizations can simplify portions of logic. In some casesthe number of cells is reduced.
The runtime of our algorithm scales well for large circuits due to the use of logic sim-
ulation as the main optimization engine. Furthermore, the high percentage of equivalence
checking calls that confirmed the equivalence of our transformations indicates that signa-
tures are effective at finding functionally equivalent candidates. Furthermore, we observe
that SAT-based equivalence checking requires a small fraction of the total runtime com-
pared to constructing optimal topologies, even for our larger circuit examples. This small
runtime can be attributed to the locality of most structuralt nsformations. Because the
structures of the original and modified circuits are similar, the SAT instance can be greatly
reduced in size and complexity. This limits the complexity of our approach, which tends
not to grow with the size of the overall circuit.
To check if our techniques provide comparable improvement when the initial place-
ment is optimized for timing, we performed the following expriment. We first produced
64 independent initial placements optimized for total wirelength. Compared to these 64
wirelength-optimized placements, the best placements achieve 17.0% shorter delay on av-
erage and serve as proxies for timing-optimized placementsin our experiments. Starting
with these initial placements already optimized for delay,our logic restructuring approach
can extract further improvements, reducing the delay by 6.5% on average.
154
9.5.3 Comparison with Redundancy Addition and Removal
We compare our technique with timing optimization using redun ancy addition and
removal (RAR). We implement redundancy removal using signatures to identify equiva-
lent nodes up to don’t-cares. In the context of path-based resynthesis, the inputs to the
subcircuit, along with signals that have earlier arrival time and are within a bounding box
determined by the HPWL of the output, are considered as candid tes for rewiring. If one
of these signals is equivalent to the output up to don’t-cares in the circuit, rewiring is
















Table 9.2:Effectiveness of our approach compared to RAR.
In Table 9.2, we compare the delay improvement of our resynthesis strategy to redun-
dancy addition and removal. For this experiment, we report results on a random slice of
initial placements from our suite. Note that our technique achieves almost twice as much
improvement as RAR in improving delay, and our results are more c nsistent over all the
circuits and are never worse than RAR.
155
Figure 9.10: The graph above illustrates that the largestactualdelay improvements occur
at portions of the critical path with the largestestimatedgain using our metric.
The data points are accumulated gains achieved by 400 different resynthesis
attempts when optimizing the circuits in Table 1.
In Figure 9.10, we demonstrate that our delay-based NMF metric is effective at guiding
optimization. Each data point represents a different resynthesis attempt considering all of
the circuits in Table 1. The x-axis shows the predicted percentage delay gain possible
(determined by the optimal-buffered delay). The y-axis indicates the actual gain. Data
points that lie on the x-axis indicate resynthesis attemptsthat did not improve delay (a
better topology could not be found). The 50% threshold line divides the graph so that
the number of resynthesis attempts are equal on both sides. The diagonal line indicates
an upper-bound prediction for delay gain. Because some of the optimizations reduce the
support of the original subcircuit, we can improve the delaybe ond the original estimate
which considers all of the subcircuit’s inputs. Therefore,some of the data points are
above the upper-bound line. On the other hand, a resynthesisattempt produces a smaller
156
than estimated improvement when the ideal topology is not logically feasible or when
removing cell overlap degrades the quality of the initial placement. Although the NMF
and gain calculations do not directly incorporate circuit fnctionality, 74% of all delay
gains are found on the right half of the graph. The correlation o our metric could be
further improved by incorporating the percentage of gain possible with respect to near-
critical paths.
9.6 Concluding Remarks
In this chapter, we leveraged our simulation-based framework to improve the quality
of delay optimization without sacrificing other performance metrics. In particular, we
introduced a novel simulation-guided synthesis strategy that is more comprehensive than
current restructuring techniques. We developed a path-monotonicity metric to focus our
efforts on the most important regions of a design. Our optimizations lead to 11.7% delay
improvement on average, over several different initial placements. Also, our delay-based
monotonicity metric indicated that 65% of the paths analyzed w re non-monotone. We
further observe delay improvements on placements initially optimized for delay, which are
consistent with our reported average improvement. We believe that our approach offers
an effective bridge between current topological-based synthesis and lower-level physical
synthesis approaches. It enables less conservative timingestimates to be made early in the
design flow so that other performance metrics can be improvedwithout adversely affecting




Achieving timing closure is becoming increasingly difficult due to the increasing sig-
nificance of interconnect delay.For complex designs, failing to achieve timing closure re-
sults in costly design-flow iterations and delays market entry of the final product. Previous
strategies for achieving timing closure are often incapable of exploiting logic transforma-
tions that promise significant delay improvements. In this dissertation, we introduce an
aggressive physical synthesis application that employs a bro der set of optimizations to
reduce interconnect delay while minimizing impact on the remaining circuit.The goal is
to improve timing closure, as interconnect becomes more dominant and current method-
ologies become less adequate. To enable powerful transformati ns, we leverage logic sim-
ulation to characterize the behavior and flexibility of inter al nodes using bit signatures.
By performing logic manipulations on the signatures instead of the circuit, we abstract
away much of the design complexity enabling numerous transformations to be examined.
Transformations that result in the greatest delay improvements are verified formally, while
the scalability of such verification is dealt with particular c re.
158
10.1 Summary of Contributions
In this dissertation, we developed a comprehensive signature-based framework that ef-
ficiently identifies logic optimizations in complex digitaldesigns. This framework consists
of the following components.
• A simulation strategy for sensitizing parts of a design thatare difficult to control
from the primary inputs with the goal of exposing corner-case behavior in the de-
sign. We developed a novel metric for determining the information content (entropy)
of different groups of signals under given simulation vectors and introduced a SAT-
based algorithm for evenly sensitizing these signals. In the experimental evaluation,
our techniques evenly sensitized a design where random simulat on had not suc-
ceeded.
• A strategy for efficiently computing don’t-cares and encoding them in signatures to
enhance synthesis optimizations. We showed that the approximation used to gener-
ate don’t-cares was both fast and accurate.
• A SAT-solving methodology that leverages the increasing avail bility of multi-core
systems to enable more efficient verification of signature-based transformations. We
introduced a priority scheduler for handling multiple SAT instances of varying com-
plexity and proposed a lightweight parallelization strategy to solve particularly hard
instances.
• Techniques for logic manipulation based on signatures. We introduced a node-
merging optimization that leads to significant area reductions. We also developed a
159
signature-based resynthesis strategy that can be efficiently guided by physical opti-
mization criteria, such as delay and wirelength minimization, as well as routability.
We then introduced a post-placement resynthesis strategy srategy that uses path
monotonicity to identify paths that are most amenable to performance optimization.
Empirical results indicated the effectiveness of our signature-based methodology. We
showed that logic simulation can efficiently target hard-to-sensitize regions in a circuit.
Furthermore, we demonstrated that signatures are a good approximation of a node’s func-
tionality, and can account for both controllability and sati fiability don’t-cares. For exam-
ple, signatures were used to effectively identify nodes that could be merged, and don’t-
cares facilitated additional node mergers. These results indicated the potential of func-
tional simulation to support fast and powerful design optimizations. In the physical syn-
thesis domain, we demonstrated that the ability to quickly identify numerous resynthesis
opportunities is particularly advantageous. Empirical results confirmed that our techniques
compare favorably with earler algorithms.
10.2 Directions for Future Research
The use of logic simulation as an abstraction represents a major contribution of our
work. This abstraction simplifies search for powerful optimizations subject to functional,
temporal, and physical constraints. Our techniques support a variety of performance,
power, and manufacturability objectives.
We believe that our don’t-care analysis is also useful to enhance verification coverage.
In practice, a simulation vector that toggles logic containing a design bug might not pro-
duce an observable discrepancy with the golden model at the outputs. By incorporating
160
observability measures in our coverage analysis to guide a SAT-based resimulation, we
could improve the quality of simulation performed.
Finally, we observe that our work in parallel SAT enables thedevelopment of a new
methodology in CAD tool flows that better utilizes multi-core systems. A future avenue
of research would consider multiple incremental optimizations applied in parallel. Such




1-UIP, 19, 20, 96
abstraction, vi, 7–12, 35, 40, 43, 76, 77, 85, 111,
112, 128, 149, 160
activity (toggle), 44–50, 52, 59
activity counter, 18, 110
AMD, 3
approximate ODC analysis, 64, 67–69, 72, 73
performance of, 73–75
backdoor set,seestrong backdoor, 92, 93, 99,
100
backtracking, 17, 20, 91
batch latency, 89
benchmarks
IWLS OpenCore, 59, 117, 150
SAT 2003, 96, 103, 104, 110
bit-parallel simulation, 7, 27
blocking clause, 25, 58
buffer, seeoptimally buffered line, 29, 32, 33,
39, 134, 141
insertion, 30–34, 37–39, 154, 157
bypass, 13
candidate node merger, 66, 81, 115–118, 123
circuit unrolling, 120





companion placement, 37, 38
compatibility ODCs, 25, 125, 126
conflict, 18–20, 99, 101
conflict graph, 19, 20
conflict side, 19
conflict-driven learning, 17–21
parallel, 22, 96, 97, 101, 108, 109
congestion, 35, 36
constrained random simulation, 44
constraint,seeclause
design, 5, 6, 10, 29, 30, 35, 36, 40, 145–
147, 154
functional, 148
input, 53, 56–58, 60, 62
topological, 125, 126, 130, 131, 134, 138,
144, 148
verification, 44, 45, 52, 55, 57, 63
XOR, seeXOR constraint
controllability, 9, 23, 51, 56, 62
controllability don’t-cares,seesatisfiability don’t-
cares
corner-case behavior, 12, 15, 46, 52, 63, 77, 159
coverage
verification, 10, 12, 15, 16, 28, 44–46, 48–
50, 60, 61, 63, 160, 161
critical path, 13, 40, 111, 112, 131, 135–139,
144, 146, 149, 156, 157
decision level, 19
delay
D2M, 31, 139, 150
Elmore, 31, 139




don’t-cares, vi, 10, 13, 23–25, 40, 41, 64–67, 79,






downstream logic, 9, 25, 47, 51, 65, 66, 69, 70,
74, 75, 77, 80–84, 114, 120
dynamic programming, 139
dynamic simulation vector, 79, 123, 124
empirical results
high coverage simulation, 59–63
node merging, 117–123






essential PBDs,seepairs of bits to be distigu-
ished (PBDs)
even sensitization, 51
false negatives, 69, 73, 116
false positives, 69, 73, 78, 79, 116, 123
fanin, 64, 67, 70, 79
fanin embedding, 33, 34
fanout, 32, 33, 64, 67, 68, 72, 73, 79, 83, 117,
133, 145
fanout embedding, 33, 34
fanout-free tree, 128, 129, 131–133, 147, 148
Fiduccia-Mattheyses, 49
gate sizing, 30, 31, 37
global don’t-care analysis, 64
golden model, 8, 15, 16, 160
guided simulation, 42, 43, 45, 61, 63
guiding paths, 22, 96, 101, 108
half-perimeter wirelength (HPWL), 150, 155
hashing
signature, 116–117, 149, 150
structural (strashing), 26, 35, 149
heavy-tail distribution, 91–92
implication, 19, 21, 57
interconnect, 4
delay, 29, 39, 136, 139, 158
dominance, 30, 37, 39, 158
optimization, 30, 32, 33, 37–39, 137, 145,
147, 152
scaling, 3–4, 27, 31
iterations, design flow, v, 7, 30, 35, 37, 39, 158
k-hop path, 140, 141, 152
learnt clause, 18–20, 81, 83, 96, 101, 109, 110
legalization, 31, 32, 38, 138, 148, 149, 151
local don’t-care analysis, 64, 74
logic feasibility, 126–129, 134, 145, 146, 149
logic synthesis, 5, 6, 11, 14, 23, 26, 28, 29, 35,
37–41, 112, 126, 134, 137
maxterm, 46, 47, 71, 78
minterm, 46, 47, 71, 78, 79
miter, 23–25, 28, 79–83, 85, 114
multi-core CPUs, vi, 2, 7, 9, 10, 12, 22, 40, 76,
87–89, 159, 161
multiplexor, 84, 133, 148
non-monotone factor (NMF), 140–143
non-monotone path, 136, 137, 139, 142, 143,
145–147, 149, 151, 152, 157
observability don’t-cares (ODCs), 23–25, 35, 64–
75, 79, 80, 85, 112–116, 120, 126
ODC mask, 65–68, 116
ODC-substitutability, 115–117
ODC-substitutable, 115, 116
optimally buffered line, 32, 129, 141, 156
pairs of bits to be distinguished (PBDs), 127,
128, 130, 131, 134, 144, 146, 148
essential, 127–134, 144, 146, 148
partitioning
clause database, 21
netlist, 45, 48–49, 59
search space, 22, 48, 90, 96, 98–101, 108–
110
variable, 132
physical synthesis, vi, 9, 11, 31, 32, 34, 36, 38,
39, 65, 114, 121, 157, 158
placement, v, vi, 5–7, 13, 27, 29–32, 34, 35, 38–
40, 114, 136, 148, 150, 151, 154, 155,
157
incremental, 37
portfolio, 21, 22, 90, 96, 97, 101, 105–108, 110
power consumption, 39
163
priority scheduling, 104, 105, 159
random simulation, 42–45, 48, 54, 59–63, 67,
74, 78, 79, 117, 151, 159
randomly generated SAT instances, 96
reason side, 19
reconvergence, 34, 69, 73, 133
redundancy addition and removal (RAR), 35, 137,
155
relocation, 30, 136, 145, 154
replication, 31, 34, 145
resynthesis, 13, 34, 40, 113, 128, 136, 137, 143,
145, 149, 150, 155, 156, 160




parallel solving, 20–22, 96–101
problem formulation, 16–17
satisfiability don’t-cares (SDCs), 23, 24, 64, 65,
126
sets of pairs of functions to be distinguished (SPFDs),
125, 130
Shannon’s entropy, 46
signature, v, vi, 8–11, 27, 28, 41–45, 65, 66, 76–
79, 112, 113, 115, 126–130, 132–134,














standard cells, 5, 114, 137, 139
static timing analysis (STA), 31, 138, 139, 150
incremental, 32, 149








technology mapping, 5, 6, 114
timing closure, v, 6, 7, 10, 12, 29–31, 36, 37,
158
Toggle, 43–47, 49, 52, 59–62
topological order, 68, 131, 133, 149
training a SAT solver, 100
transistor scaling, v, 2–4
undo variable assignment, 18, 96
unique-SAT, 54, 59




runtime, 21, 22, 91, 106, 110
verification
equivalence checking, 15, 16, 26, 28, 64,
69, 86, 114, 118, 126, 145, 154
incremental, 77, 81–86, 120
parallel,seesatisfiability
vias, 3, 4
window, 25, 65, 75, 86
wire-load model, 37
wirelength, 32, 37, 49, 136, 139, 141, 152–154





[1] M. Abramovici, M. Breuer, and A. Friedman, “Digital system testing and testable
design”,W.H.Freeman, 1990.
[2] M. Abramovici, J. DeSousa, and D. Saab, “A massively-parallel easily-scalable sat-
isfiability solver using reconfigurable hardware”,DAC, pp. 684-690, 1999.
[3] A. Ajami and M. Pedram, “Post-layout timing-driven cellp acement using an accu-
rate net length model with movable steiner points”,DAC, pp. 595-600, 2001.
[4] F. Aloul, B. Sierawski, and K. Sakallah, “Satometer: howmuch have we searched?”,
TCAD, pp. 995-1004, 2003.
[5] C. Alpert, A. Kahng, C. Sze, and Q. Wang, “Timing-driven steiner trees are (practi-
cally) free”,DAC, pp. 389-392, 2006.
[6] C. Alpert, A. Devgan, and C. Kashyap, “RC delay metric forperformance optimiza-
tion”, TCAD, pp. 571-582, 2001.
[7] K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patter-
son, W. Plishker, J. Shalf, S. Williams, and K. Yelick, “The landscape of parallel
computing research: a view from Berkeley”,ERL TR, Berkeley, 2006.
[8] R. Ashenhurst, “The decomposition of switching functions”, International Sympo-
sium on the Theory of Switching, pp. 74-116, 1957.
[9] L. Baptista and J. Marques-Silva, “Using randomizationand learning to solve hard
real-world instances of satisfiability”,ICPPCP, pp. 489-494, 2000.
[10] G. Beraudo and J. Lillis, “Timing optimization of FPGA placements by logic repli-
cation”,DAC, pp. 541-548, 2003.
[11] V. Bertacco and M. Damiani, “The disjunctive decompositi n of logic functions”,
ICCAD, pp. 78-82, 1997.
[12] A. Biere, A. Cimatti, E. Clarke, and Y. Zhu, “Symbolic model checking without
BDDs”, TACAS, pp. 193-207, 1999.
165
[13] D. Brand, “Verification of large synthesized designs”,ICCAD, pp. 534-537, 1993.
[14] R. Bryant, “Graph-based algorithms for Boolean function manipulation”,Trans. on
Comp., pp. 677-691, 1986.
[15] M. Bushnell and V. Agrawal, “Essentials of electronic testing”,Kluwer, pp. 129-150,
2000.
[16] A. Caldwell, A. Kahng, and I. Markov, “Can recursive bisection alone produce
routable placements?”,DAC, pp. 693-698, 2000.
[17] C.-W Chang, C.-K Cheng, P. Suaris, and M. Marek-Sadowska, “Fast post-placement
rewiring using easily detectable functional symmetries”,DAC, pp. 286-289, 2000.
[18] K. H. Chang, I. L. Markov, and V. Bertacco, “Safe delay optimization for physical
synthesis”,ASP-DAC, pp. 628-633, 2007.
[19] K.-H. Chang, I. Markov, and V. Bertacco, “Fixing designerrors with counterexam-
ples and resynthesis”,ASP-DAC, pp. 944-949, 2007.
[20] K.-H Chang, D. Papa, I. Markov, and V. Bertacco, “InVerS: an incremental veri-
fication system with circuit similarity metrics and error visualization”,ISQED, pp.
487-494, 2007.
[21] S. Chatterjee and R. Brayton, “A new incremental placement algorithm and its appli-
cation to congestion-aware divisor extraction”,ICCAD, pp. 541-548, 2004.
[22] W. Chrabakh and R. Wolski, “GraDSAT: a parallel SAT solver for the grid”,UCSB
Comp. Sci. TR, 2003.
[23] C. Chu and Y.-C. Wong, “Fast and Accurate Rectilinear Steiner Minimal Tree Algo-
rithm for VLSI Design”,ISPD, pp. 28-35, 2005.
http://class.ee.iastate.edu/cnchu/flute.html
[24] M. Davis, G. Logemann, and D. Loveland, “A machine program for theorem prov-
ing”, Comm. of ACM, pp. 394-397, 1962.
[25] R. Dechter, K. Kask, E. Bin, and R. Emek, “Generating random solutions for con-
straint satisfaction problems”,AAAI, pp. 15-21, 2002.
[26] G. DeMicheli and M. Damiani, “Synthesis and optimization of digital circuits”,
McGraw-Hill, 1994.
[27] N. Dershowitz, Z. Hanna, and A. Nadel, “Towards a betterunderstanding of the
functionality of conflict-driven SAT solver”,SAT, pp. 287-293, 2007.
166
[28] N. Een and A. Biere, “Effective Preprocessing in SAT through Variable and Clause
Elimination”,SAT, pp. 61-75, 2005.
[29] N. Een and N. Sorensson, “An extensible SAT-solver”,SAT, pp. 502-518, 2003.
http://www.cs.chalmers.se/∼een/Satzoo
[30] W. C. Elmore, “The transient response of damped linear network with particular
regard to wideband amplifiers”,J. Appl. Phys, pp. 55-63, 1948.
[31] C. Fiducia and R. Mattheyses, “A linear-time heuristicfor improving network parti-
tions”, DAC, pp. 175-181, 1982.
[32] Z. Fu, Y. Yu, and S. Malik, “Considering circuit observability don’t cares in cnf
satisfiability”,DATE, pp. 1108-1113, 2005.
[33] E. Gamma, R. Helm, R. Johnson, and J. Vlissides,Design Patterns: Elements of
Reusable Object-Oriented Software, Addison-Wesley, 1995.
[34] E. Goldberg, M. Prasad, and R. Brayton, “Using SAT for combinational equivalence
checking”,DATE, pp. 114-121, 2001.
[35] C. Gomes, W. Hoeve, A. Sabharwal, and B. Selman, “Counting CSP solutions using
generalized xor constraints”,AAAI, pp. 204-209, 2007.
[36] C. Gomes, A. Sabharwal, and B. Selman, “Near-uniform sapling of combinatorial
spaces using xor constraints”,NIPS, pp. 481-488, 2006.
[37] C. Gomes, A. Sabharwal, and B. Selman, “Model counting:a new strategy for ob-
taining good bounds”,AAAI, pp. 54-61, 2006.
[38] C. Gomes and B. Selman, “Algorithm portfolios”,AI, pp. 43-62, 2001.
[39] C. Gomes, B. Selman, K. McAloon, and C. Tretkoff, “Randomization in backtrack
search: exploiting heavy-tailed profiles for solving hard scheduling problems”,AIPS,
pp. 208-213, 1998.
[40] W. Gosti, A. Narayan, R. Brayton, and A. Sangiovanni-Vincentelli, “Wireplanning
in logic synthesis”,ICCAD, pp. 26-33, 1998.
[41] W. Gosti, S. Khatri, and A. Sangiovanni-Vincentelli, “Addressing the timing closure
problem by integrating logic optimization and placement”,ICCAD, pp. 224-231,
2001.
[42] M. Hrkic, J. Lillis, and G. Beraudo, “An approach to placement-coupled logic repli-
cation”,DAC, pp. 711-716, 2004.
167
[43] M. Hrkic and J. Lillis, “S-Tree: a technique for buffered routing tree synthesis”,
DAC, pp. 578-583, 2002.
[44] W. Jordan, “Towards efficient sampling: exploiting random walk strategies”,AAAI,
pp. 670-676, 2004.
[45] L. Kannan, P. Suaris, and H. Fang, “A methodology and algorithms for post-
placement delay optimization”,DAC, pp. 327-332, 1994.
[46] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar, “Multilevel hypergraph parti-
tioning: applications in VLSI domain”,TVLSI, pp. 69-79, 1999.
[47] A. Kuehlmann, V. Paruthi, F. Krohm, and M. Ganai, “Robust Boolean reasoning for
equivalence checking and functional property verification”, TCAD, pp. 1377-1394,
2002.
[48] V. Kravets and K. Sakallah, “Resynthesis of multi-level circuits under tight con-
straints using symbolic optimization”,ICCAD, pp. 687-693, 2002.
[49] S. Krishnaswamy, S. Plaza, I. Markov, and J. Hayes, “Reliability-aware Synthesis
using Logic Simulation”,ICCAD, pp. 149-154, 2007.
[50] F. Krohm, A. Kuehlmann, and A. Mets, “The use of random siulation in formal
verification”, ICCD, pp. 371-376, 1996.
[51] H. Hoos and T. Stutzl, “SATLIB: an online resource for research on SAT”,SAT, pp.
283-292, 2000.
[52] Y. Hu, V. Shih, R. Majumdar, and L. He, “Exploiting symmetry in SAT-based
Boolean matching for heterogeneous FPGA technology mapping”, ICCAD, pp. 350-
353, 2007.
[53] Y.-Min. Jiang, A Krstic, K.-Ting Cheng, and M. Marek-Sadowska, “Post-layout
Logic Restructuring For Performance Optimization”,DAC, pp. 662-665, 1997.
[54] H. Lee and D. Ha, “On the generation of test patterns for combinational circuits”,TR
No. 12-93, Dept. of Electrical Eng., Virginia Polytechnic Inst.
[55] M. Lewis, T. Schubert, and B. Becker, “Multithreaded SAT solving”, ASP-DAC, pp.
926-932, 2007.
[56] C. Li, C-K. Koh, and P. H. Madden, “Floorplan management: i cremental placement
for gate sizing and buffer insertion”,ASP-DAC, pp. 349-354, 2005.
[57] F. Lu, L.-C. Wang, K.-T. Cheng, J. Moondanos, and Z. Hanna, “A signal correlation
guided circuit-SAT solver”,J. UCS 10(12), pp. 1629-1654, 2004.
168
[58] P. Manolios and Y. Zhang, “Implementing survey propogation on graphics processing
units”, SAT, pp. 311-324, 2006.
[59] A. Mishchenko, S. Chatterjee, R. Jiang, and R. Brayton,“FRAIGs: a unify-
ing representation for logic synthesis and verification”,ERL TR, Berkeley, 2005.
http://www.eecs.berkeley.edu/∼alanmi/publications/
[60] A. Mishchenko, J. Zhang, S. Sinha, J. Burch, R. Brayton,a d M. Chrzanowska-
Jeske, “Using simulation and satisfiability to compute flexibilit es in Boolean net-
works”, TCAD, pp. 743-755, 2006.
[61] A. Mishchenko and R. Brayton, “SAT-based complete don’t care computation for
network optimization”,DATE, pp. 412-417, 2005.
[62] A. Mishchenko, S. Chatterjee, and R. Brayton, “DAG-aware AIG rewriting: a fresh
look at combinational logic synthesis”,DAC, pp. 532-536, 2006.
[63] A. Mishchenko, S. Chatterjee, R. Brayton, and N. Een, “Improvements to combina-
tional equivalence checking”,ICCAD, pp. 532-536, 2006.
[64] G. Moore, “Cramming more components onto integrated chips”, Electronics Maga-
zine, Vol. 38, No. 8, 1965.
[65] M. Moskewicz, C. Madigan, Y. Zhao, L. Zhang, and S. Malik, “Chaff: engineering
an efficient SAT solver”,DAC, pp. 530-535, 2001.
[66] F. Okushi, “Parallel cooperative propositional theorm proving”,Annals of Mathe-
matics and AI, pp. 59-85, 1999.
[67] R. Otten and R. Brayton, “Planning for performance”,DAC, pp. 122-127, 1998.
[68] M. Pedram and N. Bhat, “Layout driven logic restructuring/decomposition”,ICCAD,
pp. 134-137, 1991.
[69] S. Plaza, K.-H Chang, I. Markov, and V. Bertacco, “Node mrgers in the presence of
don’t cares”,ASP-DAC, pp. 414-419, 2006.
[70] S. Plaza and V. Bertacco, “STACCATO: disjoint support decompositions from BDDs
through symbolic kernels”,ASP-DAC, pp. 276-279, 2005.
[71] N. Saluja and S. Khatri, “A robust algorithm for approximate compatible observabil-
ity don’t care computation”,DAC, pp. 422-427, 2004.
[72] H. Savoj and R. Brayton, “The use of observability and external don’t-cares for the
simplification of multi-level networks”,DAC, pp. 297-301, 1990.
169
[73] P. Saxena, N. Menezes, P. Cocchini, and D. Kirkpatrick,“Repeater scaling and its
impact on CAD”,TCAD, pp. 451-463, 2004.
[74] E. Sentovich, K. Singh, L. Lavagno, C. Moon, R. Murgai, A. Saldanha, H. Savoj, P.
Stephan, R. Brayton, and A. Sangiovanni-Vincentelli, “SIS: a system for sequential
circuit synthesis”,ERL TR, Berkeley, 1992.
[75] C. Shannon, “A Mathematical Theory of Communication”,Bell System Technical
Journal, Vol. 27, pp. 379-423, 623-656, 1948.
[76] M. Sipser, “Introduction to the theory of computation,second edition”,Course Tech-
nology, 2005.
[77] S. Sinha, A. Mishchenko, and R. Brayton, “Topologically constrained logic synthe-
sis”, ICCAD, pp. 679-686, 2002.
[78] G. Stenz, B. Riess, B. Rohfleisch, and F. Johannes, “Timing driven placement in
interaction with netlist transformations”,ISPD, pp. 36-41, 1997.
[79] S. Shyam and V. Bertacco, “Distance-guided hybrid verification with GUIDO”,
DATE, pp. 1211-1216, 2006.
[80] J. Marques-Silva and K. Sakallah, “GRASP: A search algorithm for propositional
satisfiability”, IEEE Trans. Comp, pp. 506-521, 1999.
[81] L. Valiant and V. Vazirani, “NP is as easy as detecting uniq e solutions”,Theor.
Comput. Sci., pp. 85-93, 1986.
[82] L.P.P.P van Ginneken, “Buffer placement in distributed RC-tree networks for mini-
mal Elmore delay”,ISCAS, pp. 865-868, 1990.
[83] I. Wagner, V. Bertacco, T. Austin, “StressTest: an automatic approach to test genera-
tion via activity monitors”,DAC, pp. 783-788, 2005.
[84] J. Werber, D. Rautenbach, and C. Szegedy, “Timing optimization by restructuring
long combinatorial paths”,ICCAD, pp. 536-543, 2007.
[85] R. Williams, C. Gomes, and B. Selman, “Backdoors to typical case complexity”,
IJCAI, 2003.
[86] J. Yuan, K. Albin, A. Aziz, and Carl Pixley, “Simplifying Boolean constraint solving
for random simulation-vector generation”,IEEE TCAD, pp. 412-420, 2004.
[87] L. Xu, F. Hutter, H. Hoos, and K. Leyton-Brown, “SATzilla-07: the design and anal-
ysis of an algorithm portfolio for SAT”,CP, pp. 712-727, 2007.
[88] H. Zhang, “SATO: an efficient propositional prover”,CADE, pp. 272-275, 1997.
170
[89] H. Zhang, M.P. Bonacina, and J. Hsiang, “PSATO: a distributed propositional prover
and its application to quasigroup problems”,J. of Symb. Comp., pp. 1-18, 1996.
[90] L. Zhang, C. Madigan, M. Moskewicz, and S. Malik, “Efficient conflict driven learn-
ing in Boolean satisfiability”,ICCAD, pp. 279-285, 2001.
[91] Y. Zhao, M. Moskewicz, C. Madigan, and S. Malik, “Accelerating Boolean satisfia-
bility through application specific processing”,ISSS, pp. 244-249, 2001.
[92] P. Zhong, M. Martonosi, P. Ashar, and S. Malik, “Using configurable computing to
accelerate Boolean satisfiability”,TCAD, pp. 861-868, 1999.
[93] S. Yamashita, H. Sawada, and A. Nagoya, “ SPFD: a new method to express func-
tional flexibility”, TCAD, pp. 840-849, 2000.
[94] Y.-S. Yang, S. Sinha, A. Veneris, and R. Brayton, “Automating logic rectification by
approximate SPFDs”,ASP-DAC, pp. 402-407, 2007.
[95] Q. Zhu, N. Kitchen, A. Kuehlmann, and A. Sangiovanni-Vincentelli, “SAT sweeping
with local observability don’t cares”,DAC, pp. 229-234, 2006.
[96] “Constrained-random test generation and functional coverage with Vera”,TR, Syn-
opsys, Inc, Feb, 2003.
[97] Specman elite — testbench automation, 2004.
http://www.verisity.com/products/specman.html
[98] Berkeley Logic Synthesis and Verification Group, “ABC:
a system for sequential synthesis and verification”.
http://www.eecs.berkeley.edu/∼alanmi/abc/
[99] AMD, “High performance AMD Phenom X4 processors lead the c arge to HD desk-
top gaming and video”, 2008.
[100] Intel, “FDIV replacement program: statistical analysis of floating point flaw”,White
Paper, 1994
[101] The International Technology Roadmap for Semiconductors, 2005 Edition,ITRS.
[102] IWLS OpenCore Benchmarks.
http://iwls.org/iwls2005/benchmarks.html
[103] Cadence Encounter RTL Compiler.http://www.cadence.com
[104] Synopsys DesignCompiler.http://www.synopsys.com
[105] UMICH Physical Design Tools.
http://vlsicad.eecs.umich.edu/BK/PDtools/
171
