Equivalence Checking for High-Assurance Behavioral Synthesis by Hao, Kecheng
Portland State University
PDXScholar
Dissertations and Theses Dissertations and Theses
Spring 6-10-2013
Equivalence Checking for High-Assurance Behavioral Synthesis
Kecheng Hao
Portland State University
Let us know how access to this document benefits you.
Follow this and additional works at: http://pdxscholar.library.pdx.edu/open_access_etds
Part of the Hardware Systems Commons, and the Other Computer Sciences Commons
This Dissertation is brought to you for free and open access. It has been accepted for inclusion in Dissertations and Theses by an authorized
administrator of PDXScholar. For more information, please contact pdxscholar@pdx.edu.
Recommended Citation
Hao, Kecheng, "Equivalence Checking for High-Assurance Behavioral Synthesis" (2013). Dissertations and Theses. Paper 1066.
10.15760/etd.1066
Equivalence Checking for High-Assurance Behavioral Synthesis
by
Kecheng Hao
A dissertation submitted in partial fulllment of the
requirements for the degree of
Doctor of Philosophy
in
Computer Science
Dissertation Committee:
Fei Xie, Chair
Feng Liu
Sandip Ray
Bryant York
Fu Li
Portland State University
2013
iABSTRACT
The rapidly increasing complexities of hardware designs are forcing design method-
ologies and tools to move to the Electronic System Level (ESL), a higher ab-
straction level with better productivity than the state-of-the-art Register Transfer
Level (RTL). Behavioral synthesis, which automatically synthesizes ESL behav-
ioral specications to RTL implementations, plays a central role in this transition.
However, since behavioral synthesis is a complex and error-prone translation pro-
cess, the lack of designers' condence in its correctness becomes a major barrier
to its wide adoption. Therefore, techniques for establishing equivalence between
an ESL specication and its synthesized RTL implementation are critical to bring
behavioral synthesis into practice.
The major research challenge to equivalence checking for behavioral synthesis is
the signicant semantic gap between ESL and RTL. The semantics of ESL involve
untimed, sequential execution; however, the semantics of RTL involve timed, con-
current execution. We propose a sequential equivalence checking (SEC) framework
for certifying a behavioral synthesis ow, which exploits information on successive
intermediate design representations produced by the synthesis ow to bridge the se-
mantic gap. In particular, the intermediate design representation after scheduling
and pipelining transformations permits eective correspondence of internal opera-
tions between this design representation and the synthesized RTL implementation,
enabling scalable, compositional equivalence checking. Certications of loop and
ii
function pipelining transformations are possible by a combination of theorem prov-
ing and SEC through exploiting pipeline generation information from the synthesis
ow (e.g., the iteration interval of a generated pipeline). The complexity brought
by bubbles in function pipelines is creatively reduced by symbolically encoding all
possible bubble insertions in one pipelined design representation. The result of
this dissertation is a robust, practical, and scalable framework for certifying RTL
designs synthesized from ESL specications. We have validated the robustness,
practicality, and scalability of our approach on industrial-scale ESL designs that
result in tens of thousands of lines of RTL implementations.
iii
DEDICATION
To my parents, Fengming and Fanying
To my wife, Kai
To my daughter, Sophia
iv
ACKNOWLEDGMENTS
It has been a long journey to nish this dissertation research. During my research,
I got help from many people, including my teachers, collaborators, friends, and
family, therefore I would like to take this opportunity to thank all of them.
First and foremost, I wish to express my thanks and great appreciation to my
advisor Prof. Fei Xie for his patient guidance, enthusiastic encouragement and
useful critiques. He is a wonderful advisor. He guided me to learn how to identify
critical problems and resolve them, which is invaluable to my research. He is also
an amazing researcher. His rare combination of strengths in both the practical
and the theoretical has been a continuous inspiration to me. Without his help, I
do not believe I can accomplish this dissertation.
I would also like to thank Dr. Sandip Ray for his enormous amount of time
and eort that he spent on my research. All my papers related to this dissertation
are nished by collaborating with Sandip. He is a great collaborator and an out-
standing researcher. He has always helped me with my research through email and
phone, even in weekends. My thanks to Dr. Jin Yang and Dr. Naren Narasimhan
for generously providing feedback and discussing future work from an industrial
point of view.
I am grateful to my other committee members, Prof. Fu Li, Prof. Feng Liu,
and Prof. Bryant W. York, for inspiration in many ways and valuable feedback on
my research proposal and dissertation. Thank Prof. Xiaoyu Song, who initially
brought me on board to electronic design automation.
vThe dissertation also benets from various discussion with from current and
former lab members: Yan Chen, Ping Hang Cheung, Nicholas T. Pilkington, Jun-
cao Li, Sharookh Daruwalla, Zhenkun Yang, and Disha Gandhi. I also would like
to thank my colleagues at Xilinx for exchanging knowledge of high-level synthesis,
including Dr. Yiping Fan, Dr. Zhiru Zhang, Dr. Peichen Pan, and Dr. Guoling
Han.
Finally, I must acknowledge my family for supporting me all the time. Thank
you to my parents for always being supportive. Especially thanks to my lovely
wife, Kai, for her love, patience, encouragement, and immense personal sacrice.
vi
TABLE OF CONTENTS
Abstract : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : i
Dedication : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : iii
Acknowledgments : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : iv
List of Tables : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : viii
List of Figures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : ix
Chapter 1 INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : : 1
1.1 Motivation and Problem Statement . . . . . . . . . . . . . . . . . . 1
1.1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter 2 BACKGROUND : : : : : : : : : : : : : : : : : : : : : : : : 8
2.1 Behavioral Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Solver Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Binary Decision Diagram . . . . . . . . . . . . . . . . . . . . 11
2.2.2 Boolean Satisability Problem . . . . . . . . . . . . . . . . . 13
2.2.3 Satisability Modulo Theories . . . . . . . . . . . . . . . . . 15
2.3 Scalable Verication Techniques . . . . . . . . . . . . . . . . . . . . 16
2.3.1 Symbolic Simulation . . . . . . . . . . . . . . . . . . . . . . 16
2.3.2 Equivalence Checking for Logic Synthesis . . . . . . . . . . . 17
Chapter 3 EQUIVALENCE CHECKING : : : : : : : : : : : : : : : : 19
3.1 Clocked Control/Data Flow Graphs . . . . . . . . . . . . . . . . . . 19
3.2 Circuit Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Correspondence between CCDFGs and Circuits . . . . . . . . . . . 22
vii
3.4 Dual-Rail Simulation for Equivalence Checking . . . . . . . . . . . . 23
3.5 Tool Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Chapter 4 OPTIMIZATIONS : : : : : : : : : : : : : : : : : : : : : : : 29
4.1 Motivation and Overview . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Cut-points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Cut-loop Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4 Modular Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Chapter 5 SEC FOR SYNTHESIZED LOOP PIPELINES : : : : : 38
5.1 Motivation and Overview . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Challenges with Loop Pipelines . . . . . . . . . . . . . . . . . . . . 39
5.3 SEC with Reference Model . . . . . . . . . . . . . . . . . . . . . . . 40
5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Chapter 6 SEC FOR SYNTHESIZED FUNCTION PIPELINES : 51
6.1 Motivation and Overview . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2 Challenges with Function Pipelining . . . . . . . . . . . . . . . . . . 53
6.3 SEC for Function Pipelining . . . . . . . . . . . . . . . . . . . . . . 55
6.3.1 Algorithm to build Reference Model . . . . . . . . . . . . . . 57
6.3.2 SEC between CCDFGs and the RTL . . . . . . . . . . . . . 70
6.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Chapter 7 CONCLUSION AND FUTURE WORK : : : : : : : : : 74
7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 74
7.2 Future Research Directions . . . . . . . . . . . . . . . . . . . . . . . 75
7.2.1 Hierarchical Function Pipelines . . . . . . . . . . . . . . . . 75
7.2.2 Verication of Behaviorally Synthesized Interfaces . . . . . . 76
7.2.3 SEC for Compiler Transformations in Behaviorial Synthesis . 76
References : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 77
viii
LIST OF TABLES
3.1 Bit-level equivalence checking statistics . . . . . . . . . . . . . . . . 27
3.2 Word-level equivalence checking statistics . . . . . . . . . . . . . . . 28
4.1 Designs, features, and optimizations . . . . . . . . . . . . . . . . . . 35
4.2 Word-level equivalence checking statistics . . . . . . . . . . . . . . . 36
5.1 Loop pipelining experimental results . . . . . . . . . . . . . . . . . 50
6.1 Function pelining experimental results . . . . . . . . . . . . . . . . 73
ix
LIST OF FIGURES
2.1 Input and output of behavioral synthesis . . . . . . . . . . . . . . . 10
2.2 Decision tree representation . . . . . . . . . . . . . . . . . . . . . . 12
2.3 BDD transformations . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Simple cut-point example . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 CCDFGs for the TEA encryption function . . . . . . . . . . . . . . 20
3.2 Operation mapping between CCDFG and circuit . . . . . . . . . . . 22
3.3 Dual-rail simulation scheme for equivalence checking between CCDFG
and circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Framework of equivalence checker . . . . . . . . . . . . . . . . . . . 26
4.1 C source code and CCDFG for GCD . . . . . . . . . . . . . . . . . 31
4.2 Cut-loop optimization for GCD example . . . . . . . . . . . . . . . 32
4.3 Modular SEC for 3DES . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 Example of loop pipeline . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Input and output CCDFGs of loop pipelining transformation . . . . 42
5.3 Construction of scheduling steps . . . . . . . . . . . . . . . . . . . . 44
5.4 Construction of edges . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.5 Pipeline registers and forwarding . . . . . . . . . . . . . . . . . . . 48
6.1 Example of function pipeline . . . . . . . . . . . . . . . . . . . . . 52
6.2 Dierence between un-pipelined version and pipelined version . . . 53
6.3 Hardware interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
6.4 Pipelined CCDFGs for dierent bubble insertion scenarios . . . . . 56
6.5 Input and output CCDFGs of function pipelining transformation . . 59
6.6 Generate pipeline registers . . . . . . . . . . . . . . . . . . . . . . . 61
6.7 Construction of scheduling steps and edges . . . . . . . . . . . . . . 64
6.8 Insert guard variables and assignment . . . . . . . . . . . . . . . . . 65
6.9 Final pipelined CCDFG . . . . . . . . . . . . . . . . . . . . . . . . 68
x6.10 Waveform of pipeline forwarding . . . . . . . . . . . . . . . . . . . . 69
1Chapter 1
INTRODUCTION
1.1 MOTIVATION AND PROBLEM STATEMENT
1.1.1 Motivation
Recent years have seen increasingly higher complexities in hardware designs, re-
sulting from advances in VLSI technology as well as growing demands on perfor-
mance and power imposed by modern applications. Such complexities, in addition
to stringent time-to-market requirements, make it challenging to develop reliable,
high-quality systems through hand-crafted Register Transfer Level (RTL) imple-
mentations. This underlines the needs for modeling, synthesis, and validation of
hardware at higher levels of abstraction and has motivated a gradual migration
away from RTL towards Electronic System Level (ESL) which allows design func-
tionality to be described abstractly in high-level languages such as SystemC or
C/C++. However, practicality of ESL designs crucially depends on reliable tools
for behavioral synthesis, that is, automated synthesis of a hardware circuit from its
ESL description. Behavioral synthesis tools apply a sequence of transformations
to compile the behavioral description to an RTL implementation.
Several behavioral synthesis tools are available today, e.g., AutoESL [72], Cat-
apultC [54], C-to-Silicon [11], Cynthesizer [23], Spark [25], and LegUp [13]. Nev-
ertheless, and despite a great need, behavioral synthesis has not yet found wide
acceptance in current industrial practice. A major barrier to its adoption is the
2lack of designers' condence in correctness of synthesis tools themselves. The large
semantic gap between a synthesized implementation and its behavioral description
makes it hard to ensure that the synthesized implementation indeed conforms to
its behavioral description. On the other hand, many employed behavioral synthesis
transformations include complex optimizations to satisfy the growing demands of
performance and power. Consequently, current synthesis tools are often either (a)
error-prone or (b) overly conservative, thus often producing circuits of poor quality
and performance. Therefore, developing tools and technologies to ensure correct-
ness of behavioral synthesis tools is a critical issue to bring behavioral synthesis
into practice. This is the main motivation of this dissertation research.
1.1.2 Problem Statement
To ensure correctness of behavioral synthesis, we employ a formal verication tech-
nique called sequential equivalence checking (SEC). We need to address the fol-
lowing three key research challenges brought about by the semantic gap between
ESL and RTL descriptions of the same hardware:
 How can we build a practical sequential equivalence checking framework? The
signicant semantic gap makes the direct equivalence checking between ESL
and RTL impractical. The challenge is to eectively close the semantic gap
and build a practical checking framework.
 How can we scale to industry designs? The scale of real-world industrial de-
signs synthesized by a state-of-the-art behavioral synthesis tool is typically
tens of thousands of lines of RTL. The running time for equivalence checking
is exponential in the size of a design. Due to the semantic gap, the ex-
isting optimizations for traditional hardware verication cannot be directly
applied. To make our approach scalable, we need to develop optimizations
that specially target equivalence checking for behavioral synthesis.
3 How can we verify the correctness of overlapping executions introduced by
pipelines? Loop pipelining and function pipelining are key transformations
to improve the quality of results (QoR) of RTL implementations generated
by behavioral synthesis. Loop pipelining reduces the latency of synthesized
designs by concurrently executing operations in successive loop iterations.
Function pipelining improves the throughput by allowing operations from
successive invocations of a function to execute in parallel. However, they are
complex transformations involving aggressive scheduling strategies to allow
overlapping executions and careful control generation to eliminate hazards.
This makes formal equivalence checking even more challenging.
1.2 CONTRIBUTION
We developed a scalable SEC framework for certifying behavioral synthesis ows,
that makes use of the intermediate design representation from the synthesis tool
as well as other synthesis information to achieve scalability. We dened a graph-
based representation of the ESL specication, namely Clocked Control/Data Flow
Graph (CCDFG) as the intermediate design representation. In particular, this
dissertation makes the following contributions:
 A scalable SEC algorithm based on symbolic simulation for comparing CCD-
FG and RTL. Equivalence checking involves a word-level dual-rail symbolic
simulation, which simulates two design representations simultaneously. In
our approach, we simulate the CCDFG and the RTL implementation, re-
spectively, and their simulations are synchronized by clock cycle.
 A set of key optimizations, which exploit the close correspondence between
CCDFGs and their synthesized RTL designs. Cutpoints reduce lengths of
symbolic expressions by replacing veried sub-circuits by new symbol values.
4Cut-loop partitions SEC for a loop into three checks to avoid expensive x-
point computation. Modular analysis optimizes SEC by replacing veried
sub-modules by uninterpreted functions.
 An approach to certifying behaviorally synthesized loop pipelines. Our ap-
proach works by (1) constructing a provably correct loop pipeline reference
model from the ESL specication, and (2) applying sequential equivalence
checking between this reference model and synthesized RTL. The key in-
sight is that a parameterized, synthesis-guided reference transformation on
CCDFG permits comparison with RTL even after mappings with the original
ESL specication have been destroyed by an aggressive transformation such
as pipelining. Furthermore, the approach permits smooth integration with
pipeline-oblivious optimizations such as cut-loop.
 An approach to certifying function pipelining in behavioral synthesis. We
develop a reference function pipelining transformation, which takes certain
pipeline parameters from behavioral synthesis to generate a pipeline reference
model. The key is that bubble insertion is faithfully captured by symboli-
cally encoding bubbles in the reference model. We check the equivalence
between the reference model and the RTL implementation. The mapping
between behavioral level operations and RTL functional units are still pre-
served, therefore some key optimizations, such as cutpoints, are applicable.
Note that certication with our approach assumes that the high-level trans-
formations performed by behavioral synthesis before pipelining are correct. Such
transformations are typically compiler transformations (e.g., loop unrolling, code
motion, dead code elimination, etc.) and scheduling. Certainly many of these
transformations are complex, and a complete certication of the synthesis ow re-
quires certication of these high-level transformations as well. We do not address
5certication these high-level transformations for this dissertation for several reason-
s. First, these transformations are generic compiler transformations. For instance,
AutoESL, a widely used commercial synthesis tool, and LegUp, an academic syn-
thesis tool, make use of the open-source LLVM compiler transformations [50]. As
such they are more trusted than the low-level transformations which often involve
manual tweaks to squeeze out extra eciency. Second, these transformations are
already being studied elsewhere in the context of compiler verication [48, 67, 74].
Finally, related eorts at the University of Texas and the Portland State Universi-
ty are focused on the use of theorem proving techniques for certication of many
of the high-level transformations necessary. When combined with their certied
transformations, our framework can be used to certify an entire synthesis ow.
1.3 RELATED WORK
Several early approaches have been proposed to verify the correctness of the pio-
neering behavioral synthesis tools. An early eort on verication of high-level syn-
thesis targeting the behavioral portion of VHDL was proposed by Chapman [14],
which aimed to verify parts of a high-level synthesis system by giving semantics to
the representation languages used. A translation from behavioral VHDL to depen-
dence ow graphs [36] was veried by structural induction based on the CSP [32]
semantics. There has been research on certied synthesis of hardware from for-
mal languages such as HOL [28] in which a compiler that automatically translates
recursive function denitions in HOL to clocked synchronous hardware has been
developed. A certied hardware synthesis from programs in Esterel (a synchronous
design language) has also been developed [61] in which a variant of Esterel was
embedded in HOL to enable formal reasoning.
There has been much research on sequential equivalence checking between RTL
and gate-level hardware designs [4, 38]. Research has also been done on combina-
tional equivalence checking between high-level designs in software-like languages
6(e.g., SystemC) and RTL designs [33]. There has also been eort for SEC between
software specications and hardware implementations [21]: GSTE assertion graph-
s [73] were extended so that an assertion graph edge have pre and post condition
labels, and also associated assignments that update state variables. There has also
been work on equivalence checking with other graph representations, e.g., Signal
Flow Graph [17].
In recent years, promising progress on equivalence verication between system-
level models and RTL has been made in both academia and industry [53]. Kundu
et al. [46] presents an approach to validate the result of behavioral synthesis us-
ing insights from translation validation, automated theorem proving and relational
approaches to reasoning about programs. This approach targeted compiler trans-
formations, so it cannot verify the correctness of scheduling, binding and nite
state machine (FSM) generation. Clark et al. [18] proposed an algorithm that
checks behavioral consistency between an ANSI-C program and a circuit given in
Verilog using Bounded Model Checking. Both the circuit and the program are un-
wound and translated into a formula that represents behavioral consistency. The
formula is then checked using a SAT solver. Kroening [44] has further enhanced
this algorithm by using predicate abstraction and induction. These approaches
aim to check if the RTL holds the same property as the corresponding ANSI-C
program, not equivalence checking. The Sequential Logic Equivalence Checker
(SLEC) from Calypto [12] can verify RTL implementations using system mod-
els written in C/C++ or SystemC, without requiring additional testbenches or
assertions. SLEC utilizes a novel technique to reduce the SEC problem to a cycle-
accurate designs from the original designs, on which standard equivalence checking
techniques can then be deployed [15]. Hector [42] from Synopsys is a formal equiv-
alence checking framework to address the system level to RTL formal verication
problem, which integrates multiple bit-level and word-level equivalence checking
7techniques. It employs an ecient formal model constructed from high-level de-
scriptions using symbolic simulation [43]. Several optimizations can be applied to
minimize the size of the formal model to reduce the complexity. Furthermore, an
approach has been proposed to handle memory interfaces by using memory map-
ping provided by the user as invariants for an induction proof [41]. However, these
two industrial tools do not have published benchmarks for comparison.
There is a signicant literature on verifying pipelined microprocessors [10, 37,
49, 69], which has parallels with our work. Comparing function pipelines generat-
ed by behavioral synthesis with pipelines in microprocessors, certifying pipelines
generated by behavioral synthesis is more challenging due to: (1) pipelines can be
very deep; (2) each pipeline stage can be quite complex. There has been very little
published work on formal verication of pipelines generated by behavioral synthe-
sis. Nevertheless, any viable SEC framework for behavioral synthesis must handle
pipelining transformations. To our knowledge current implementations either in-
volve cost-prohibitive input-output comparison or require the user to provide the
requisite mappings.
1.4 DISSERTATION OUTLINE
The rest of this dissertation is organized as follows. In Chapter 2, we give a
brief overview of background including behavioral synthesis ows and verication
technologies. In Chapter 3, we present our intermediate representation and dis-
cuss in detail our approach to equivalence checking based on word-level symbolic
simulation. In Chapter 4, we discuss three optimizations targeting dierent de-
sign features. We illustrate our approach to equivalence checking for behaviorally
synthesized loop pipelines and function pipelines in Chapter 5 and Chapter 6, re-
spectively. In Chapter 7, we summarize the contribution and discuss future work.
8Chapter 2
BACKGROUND
2.1 BEHAVIORAL SYNTHESIS
With the rapid increase of complexity in System-on-Chip (SoC) designs, the Elec-
tronic Design Automation (EDA) community is becoming more interested in de-
signing hardware with a behavioral level model, rather than an RTL description.
This and the increased use of high-level languages in behavioral modeling has led to
a renewed interest in behavioral synthesis, both in industry and in academia [70].
A behavioral synthesis tool accepts a behavioral description, together with a
library of hardware resources; it performs a sequence of transformations on the de-
scription to generate an RTL implementation. The transformations can be roughly
partitioned into the following three phases.
 The rst phase involves compiler transformations. These include loop un-
rolling, common subexpression elimination, etc. Furthermore, expensive op-
erations (e.g., division) are often replaced with simpler ones (e.g., subtrac-
tion).
 The second phase is scheduling, which determines the clock step for each oper-
ation. The ordering between operations is constrained by the data and con-
trol dependencies. Scheduling transformations include chaining operations
across conditional blocks and decomposing one operation into a sequence of
multi-cycle operations based on resource constraints.
9 The third phase is resource binding and control synthesis, which binds op-
erations to functional units, allocates and binds registers, and generates the
control circuit to implement the schedule.
After these transformations, the RTL implementation is generated, which is
subjected to further manual optimizations to ne-tune for performance and pow-
er. Each synthesis transformation is non-trivial. The result of their composition
is a hardware implementation with large semantic distance from its input descrip-
tion. As an example, consider the synthesis of the Tiny Encryption Algorithm
(TEA) [71]. Figure 2.1 shows a high-level C specication and the circuit synthe-
sized by AutoESL. TEA, of course, is only a pedagogical algorithm, and indeed,
rather weak in cryptographic strength; nevertheless, the example highlights some
transformations involved in behavioral synthesis. The following transformations
are involved in synthesis of the circuit.
 In the rst phase, constant propagation removes unnecessary variables and
operations.1 For instance, variable delta is replaced by constant value.
 In the second phase, a key scheduling transformation performed is pipelining,
to enable overlapping execution of operations from dierent loop iterations.
 In the third phase, operations are bound to hardware resources (e.g., the
\+" operation to a hardware adder); furthermore, a nite-state machine is
generated to control circuit operations.
Each synthesis transformation must respect a number of implicit design invari-
ants. For instance, paralleling operations along dierent loop iterations must avoid
race conditions, and scheduling must respect underlying data dependencies. Since
such considerations are entangled with low-level optimization heuristics, it is easy
1Another compiler transformation that could be performed is loop unrolling. We avoided it
for presentation simplicity.
10
vo id en c r yp t ( u i n t 3 2 t  v , u i n t 3 2 t  k ) f
u i n t 3 2 t v0=v [ 0 ] , v1=v [ 1 ] , sum=0, i ;
u i n t 3 2 t d e l t a=0x9e3779b9 ;
u i n t 3 2 t k0=k [ 0 ] , k1=k [ 1 ] , k2=k [ 2 ] , k3=k [ 3 ] ;
f o r ( i = 0 ; i < 32 ; i++) f
sum += de l t a ;
v0 += ( ( v1 << 4)+ k0 ) ^ ( v1 + sum) ^( ( v1 >> 5) + k1 ) ;
v1 += ( ( v0 << 4)+ k2 ) ^ ( v0 + sum) ^( ( v0 >> 5) + k3 ) ;
g
v [0]= v0 ; v [1 ]= v1 ;
g
(A) C code for TEA
V1_0p
tmp39
Phi
newPhi
newbin
1
Phi
V1_0
== 32
sum0
0x
9e
37
79
b9

Phi
V0_0
tmp26
tmp49
tmp41
out = ((i0<<4) +i1)^ i2
i0 i1 i2
out
FSM
0
out =(i0+i1)^ ((i0>>5)+i2)
i0 i1 i2
out
V[1]k2
k0 k1
k3
V[0]
V[1]
V[0]
Pipeline 
logic
out = ((i0<<4) 
+i2)^ (i0+i1)^ ((i0>>5)+i3)
i0 i1 i2 i3
out
(B) Schema of RTL
Figure 2.1: Input and output of behavioral synthesis
11
to have errors in the synthesis tool itself, leading to synthesis of buggy designs. On
the other hand, the semantic distance pointed to above makes direct comparison
of executions of the synthesized RTL and its input description very challenging if
not infeasible. Indeed, attempts to perform such comparison through sequential
equivalence checking requires full, cost-prohibitive symbolic co-simulation between
the C and the RTL to check their input/output correspondence [33].
2.2 SOLVER TECHNOLOGY
2.2.1 Binary Decision Diagram
Binary Decision Diagrams (BDDs) have been widely employed in many applica-
tions. Though BDDs are not new [47], Bryant's pioneering work renewed the
interest of many researchers [8]. He proposed Reduced Ordered Binary Decision
Diagrams (ROBDDs, or OBDDs for short). The key insight is that reduced and
ordered binary decision diagrams are a canonical representation of Boolean func-
tions. Canonicity reduces the semantic notion of equivalence to the syntactic notion
of isomorphism. Thus, checking the equivalence of two Boolean formulas can be
reduced to comparisons of BDDs which can be checked in constant time.
Given a Boolean function, it can be represented as a rooted, directed acyclic
graph, which is actually a tree. Figure 2.2 (b) illustrates a representation of func-
tion f(x1; x2; x3) dened by the truth table given in Figure 2.2 (a). The variable
ordering is given as x1 < x2 < x3. Each nonterminal vertex v has arcs directed to-
ward two children: lo(v) (shown as a dashed line) corresponding to the case where
v is assigned 0, and hi(v) (shown as a solid line) corresponding to the case where
v is assigned 1. Each terminal vertex is labeled 0 or 1. For a given assignment to
the input variables of f , the return value of f can be determined by a path from
the root to a terminal vertex, following the branches indicated by the values of
nonterminal vertices.
12
x1 x2 x3 f 
0 0 0 0 
0 0 1 0 
0 1 0 0 
0 1 1 1 
1 0 0 0 
1 0 1 1 
1 1 0 0 
1 1 1 1 
 
0 0 0 1 0 1 0 1
x1
x2 x2
x3 x3 x3 x3
(a) (b)
Figure 2.2: Decision tree representation
A decision tree can be reduced to a BDD by applying the following three trans-
formations: (1) Remove duplicate terminals. This transformation eliminates all
but one terminal vertex with a given label and redirects all arcs into the eliminat-
ed vertices to the remaining one (shown in Figure 2.3 (a)). (2) Remove duplicate
nonterminals. In this step, if nonterminal vertices u and v have lo(u) = lo(v),
and hi(u) = hi(v), then one of the two vertices can be eliminated. All incoming
arcs to the eliminated one are redirected to the other vertex (shown in Figure 2.3
(b)). (3) Remove redundant tests. If nonterminal vertex v has lo(v) = hi(v), then
this vertex can be eliminated. All incoming arcs to v are redirected to one of its
children (shown in Figure 2.3 (c)).
BDDs have proven to be a successful representation for model checking on
many practical applications. However one limitation of BDD-based approaches is
that the size of the BDD heavily depends on the variable ordering. For instance,
given a Boolean expression a1  b1 + a2  b2 : : : + an  bn, ordering variables as a1 <
b1 < : : : < an < bn yields an BDD with 2n nonterminal vertices. On the other
hand, ordering variables as a1 < : : : an < b1 < : : : < bn yields an BDD with
2(2n 1) nonterminal vertices. For large values of n, the exponential growth of the
13
x3
0 1
x1
x2 x2
x3 x3 x3
(a)
0 1
x1
x2 x2
x3 x3
(b)
0 1
x1
x2
x3
(c)
Figure 2.3: BDD transformations
second ordering has a dramatic eect on runtime and memory usage comparing
with the rst linear growth. Unfortunately, nding the best variable ordering is an
NP-complete problem. In practice, the ordering is chosen either manually or by a
heuristic analysis of the particular system to be represented [24, 51, 35].
2.2.2 Boolean Satisability Problem
Boolean Satisability (SAT) is a decision problem of determining if there exist
suitable value assignments to the variables to satisfy the propositional logic formu-
la. SAT is the rst known example of NP-complete decision problem [19], which
means that, unless P = NP , all SAT algorithms require worst-case exponential
time. Modern SAT algorithms are eective to deal with large search spaces by ex-
ploiting the structure of the problem [65, 55, 27]. SAT techniques are widely used
in a number of areas, such as combinational equivalence checking [7], model check-
ing [6, 63], automatic test pattern generation (ATPG) [16], FPGA routing [56] and
planning [40].
14
Currently, most state-of-the-art SAT solvers require the propositional formu-
las to be represented in Conjunctive Normal Form (CNF) as dened in Deni-
tion 3 [34]. A CNF formula may be viewed as a set of clauses and a clause may be
viewed as a set of literals.
Denition 1 Literal. A literal is either a variable p or its negation :p. The rst
case is called a positive literal; the second is called a negative literal.
Denition 2 Clause. A clause is a nite disjunction of literals, e.g., l1_ l2_ l3 : : :,
where li is a literal.
Denition 3 Conjunctive Normal Form. A propositional formula is in Con-
junctive Normal Form (CNF) if it is a nite conjunction of clauses, e.g., C1 ^C2 ^
C3 : : :, where Ci is a clause.
An assignment A for a set of variables X is a function A : X ! f0; u; 1g, where
0  u  1. Here, 0 and 1 represent false and true, respectively. Given an as-
signment, clauses and CNF formulas can be characterized as unsatised, satised,
or unresolved [64]. The SAT problem for a CNF formula ' consists in deciding
whether there exists an assignment to the problem variables, such that ' is sat-
ised, or proving that no such assignment exists. An assignment that satises a
formula ' is called a satisfying assignment.
A combinational circuit can be translated into some intermediate representa-
tion, which can be used to generate CNF formulas. Combinational Boolean cir-
cuits [6] is one of the most accepted intermediate representations. Combinational
Boolean circuits are composed of gates and connections between gates. In Combi-
national Boolean circuits, the notation y = Op(x1; x2) denotes a gate which has two
inputs x1 and x2 and single output y, and Op is one of the basic logic operations,
such as AND, OR, etc. Converting Boolean circuits to CNF is straightforward,
and follows the procedure rst outlined by G. Tseitin [68].
15
2.2.3 Satisability Modulo Theories
Although SAT solvers have achieved success in many practical applications, some
applications require greater modeling exibility than plain SAT; for instance, a
theory of array of integers is more eective in modeling memory usage of a pro-
gram. On the other hand, general-purpose rst-order theorem provers are typically
not able to solve such formulas directly. The main reason for this is that many ap-
plications require not only general rst-order satisability, but rather satisability
with respect to some background theory, which xes the interpretations of certain
predicate and function symbols [2].
Satisability Modulo Theories (SMT) is the problem of deciding the satisa-
bility of a rst-order formula with respect to some decidable rst-order theory T .
It requires deciding the satisability of formulae which are Boolean combinations
of atomic Boolean propositions and atomic propositions in T , so that Boolean
reasoning is carried out by powerful SAT solvers while reasoning in the theory T
is carried out by ecient theory-specic decision procedures.
Most SMT solvers use Nelson-Oppen [57] method which combines decision
procedures for dierent decidable theories under certain conditions to generate a
decision procedure for their composition. These solvers support the theories of inte-
gers, reals, lists, arrays, bit vectors, etc. Therefore, it allows us to model hardware
circuits at word-level rather than bit-level. SMT solvers also support uninterpret-
ed functions. For instance, an SMT solver can determine whether f(x) = f(y), if
x = y. This allows us to model the hierarchies in hardware circuits by modeling a
lower-level circuit as an uninterpreted function.
Currently there are two main approaches to SMT solving: eager and lazy [42].
Eager SMT solvers rst try to solve the word-level problem by employing pre-
processing, rewrites and abstraction. If the rewrites are not sucient, the word-
level formula is translated into a bit-level formula and use a SAT solver to determine
the satisability. One of the advantages of eager SMT solvers is that they can
16
directly leverage any ecient SAT solver. Some eager SMT solvers are BAT [52]
and STP [26]. In contrast, lazy SMT solvers integrate theory specic procedures
for the background theory with a SAT-solver. A given formula  is abstracted to
a Boolean formula b. The abstraction is generated by replacing all atomic theory
predicates in  by Boolean variables. The Boolean variables of the abstracted
formula b are sub-formulas of formula  corresponding to sub-formulas of the
background theory. The satisability of b is checked using a SAT solver. If b is
unsatisable, then  is also unsatisable. Some lazy SMT solvers are Yices [20]
and CVC3 [3].
In this research, we employ CVC3, one of the most successful SMT solvers,
which is being developed at New York University and University of Iowa. CVC3
provides several dierent user interfaces including high-level APIs for both C and
C++, an interactive command-driven interface, and a le interface.
2.3 SCALABLE VERIFICATION TECHNIQUES
2.3.1 Symbolic Simulation
Simulation is the most common method for testing and debugging hardware design-
s. But the problem is that one simulation run can only validate one test case. To
fully verify a hardware design, engineers must exhaustively simulate the entire set
of test cases to explore the whole state space, which is extremely time-consuming.
Symbolic simulation allows us to compute information on the entire set of values
in a single simulation run, because the set of test vectors is encoded symbolical-
ly, instead of using a specic element of the set [9]. This approach dramatically
improves the eciency of design validation.
Consider a 2-bit AND operator which has two 2-bit inputs A and B and a 1-bit
output C. In order to fully verify this operator, a conventional simulator must try
all 16 possible test vectors. But for symbolic simulation, we treat the inputs as
17
two 2-bit symbols A and B and the output of the simulation is C = A^B by only
one simulation run.
However, symbolic simulation also has two main bottlenecks when applying to
verify large designs [33]. First, because symbolic simulation enumerates all possi-
ble execution paths, the number of paths to be explored may grow exponentially.
Second, the terms representing the symbolic values of variables may also blow-up
exponentially. Moreover, symbolic exploration of loops may lead to long execu-
tions, which may cause further blow-up. Although modern SMT solvers are able
to handle such blow-ups to a certain extent, the performance is reduced signi-
cantly [66].
2.3.2 Equivalence Checking for Logic Synthesis
Our research leverages the success of equivalence checking for logic synthesis [33,
5, 7], and employs these ideas for equivalence checking for behavioral synthesis.
Next, we provide an overview of the equivalence checking concepts.
Equivalence checking between RTL descriptions and gate-level implementations
of combinational circuits is a mature eld with decades of research [38, 4]. To check
whether two combinational circuits are functionally equivalent, we need to prove
that, for all possible inputs, both combinational circuits have the same outputs.
Hardware circuits are modeled as Boolean expressions, so the problem of checking
whether two circuits are equivalent is converted to the problem of determining
whether two Boolean expressions are equivalent.
Consider the simple example in Figure 2.4. We can use symbolic simulation
to compute the relationship between inputs and outputs. Given the two circuits
and the same input symbols, the outputs are symbolic expressions in terms of the
inputs. For the left-hand circuit in Figure 2.4, the value of the nal output f is
a  ((b ^ c) ^ d). Similarly, we can compute that the output of the right-hand
circuit is a (b ^ (c ^ d)). To verify the equivalence of these two circuits, we just
18
f
d
c
b
a
x
f
d
d
c
a
xb
Figure 2.4: Simple cut-point example
need to verify whether these two expressions are equivalent.
As discussed above, one bottleneck of symbolic simulation is the exponential
blow-up of the expression lengths. The major practical reduction technique for
combinational equivalence checking is cut-points [5, 7]. The main idea is to look
for the corresponding points in the two circuits that can be proven to be equivalent;
then the equivalent circuits can be cut out of circuits and replaced by new primary
symbols. For instance, in Figure 2.4, to introduce cut-point x, we rst verify that
(b^c)^d is equivalent to b^(c^d). Then we cut the sub-circuits o and introduce
new symbol x to represent the equivalent circuits. Then we can verify that f is
equivalent to g because they are both equal to ax. Therefore, the complexity of
verication is reduced. In general, the method is conservative: if the proof fails,
we cannot conclude that the two circuits are inequivalent. The reason is that when
we introduce new symbols for cut-points, we may lose constraints. The situation
that two circuits are equivalent but equivalence checker reports inequivalence is
called false negative. In general, the solution to this problem is to re-introduce
constraints on the cut-points [33, 7].
19
Chapter 3
EQUIVALENCE CHECKING
In this charpter, we present a graph-based design representation, called Clock
Control/Data Flow Graph (CCDFG), as our intermediate representation. Our e-
quivalence checking between a CCDFG and its synthesized RTL implementation
is based on dual-rail symbolic simulation. The checking approach has been imple-
mented to be fully automatic.
3.1 CLOCKED CONTROL/DATA FLOW GRAPHS
A CCDFG can be viewed as a formal control/data ow graph (CDFG) | used as
internal representation in most synthesis tools | augmented with a schedule. The
semantics of CCDFG are formalized in the logic of the ACL2 theorem prover [39].
Figure 3.1 shows two CCDFGs for the TEA encryption function: an initial CCDFG
derived from the C code, and its successive transformation after pipelining. This
section briey discusses the formulation of a CCDFG; for a more complete account,
see [59].
The formalization of CCDFG assumes that the underlying language provides
the semantics for primitive operations (e.g., arithmetic operations, comparison,
etc.). The key components of the formalization are (1) control and data ow
graphs, (2) microstep partition, and (3) schedule. Following standard conventions,
the control ow is broken up into of basic blocks; correspondingly data dependen-
cies follow the \read after write" paradigm: opj is dependent on opi if opj occurs
after opi in a control path and computes an expression over some variable v that is
20
newPhi = phi (0, newbin);
v1_0 = phi (v[1], tmp56);
v0_0 = phi (v[0], tmp41)
newPhi == 32
Input
newbin = newPhi + 1
sum0 = newPhi*delta0
v[0] = v0_0;
v[1] = v1_0
return
tmp26 = sum0+delta0
tmp39 = (v1_0 << 4) + k0) ^  
(tmp26 + v1_0) ^  ((v1_0>>5) 
+ k1))
tmp41 = tmp39+v0_0
tmp49 = (tmp41+tmp26) 
^((tmp41>>5)+k3)
tmp54 = ((tmp41 << 4 + k2) ^  
tmp49)
tmp56 = tmp54+v1_0
Y
N
Scheduling 
Step
Microstep delta0 = 0x9e3779b9
pl_start = 0
tmp54 = (tmp41 << 4 + k2) ^  
tmp49 
tmp56 = tmp54 + v1_0
newPhi = phi (0, newbin);
v1_0 = phi (v[1],tmp56);
v0_0 = phi (v[0], tmp41)
newPhi == 32
newbin = newPhi + 1
sum0 = newPhi*0x9e3779b9
pl_start = 1
tmp26 = sum0+0x9e3779b9
tmp39 = ( (v1_0 << 4) + k0) ^  
(tmp26 + v1_0) ^  ((v1_0>>5) 
+ k1))
tmp41 = tmp39 + v0_0
tmp49 = (tmp41 + tmp26) ^  
((tmp41 >> 5) + k3)
v[0] = v0_0
v[1] = v1_0
return
pl_start == 1
Input
N
Y
Y
N
(A) Initial CCDFG of TEA (B) CCDFG after pipelining.
Figure 3.1: CCDFGs for the TEA encryption function
assigned most recently by opi in the path. A microstep partition is a partitioning
of operations in a basic block such that if opi and opj are in the same partition
then their execution order is irrelevant to control and data dependencies. Each
component of a microstep partition is a microstep. A schedule is a grouping of mi-
crosteps; informally, ifm0 andm1 belong to the same scheduling step then they are
executed within the same clock cycle. A CCDFG execution is formalized through
state-based semantics. A CCDFG state (resp., CCDFG input) is a valuation of
the state (resp., input) variables. Given a sequence of inputs, an execution of a
CCDFG G with microstep partition M and schedule T is a sequence of CCDFG
states that corresponds to an evaluation of the microsteps of M respecting T .
Remark Conventions. For a given CCDFG G , hGCD;M; T i and a set t 2 T , we
21
use the term \projection of G on t" to mean the CCDFG Gt , hG0CD;M 0; ftgi
where G0CD and M
0 contain only the operations in GCD and M respectively, that
are members of t. For a set T0  T , we use \projection of G on T0" to denote the
following graph G0. The nodes of G0 are given by the set N , fGt : t 2 T0g; given
g0; g1 2 N , there is an edge from g0 to g1 if there are operations o1 and o2 such
that o1 2 g0, o2 2 g1 and there is an edge from o1 to o2 in GCD.
Since a schedule is a partition of microsteps, T0 induces a partition of GCD
such that if t0 6= t1 the partition induced by t0 is disjoint from that induced by t1.
Given a set T of scheduling steps, one can describe the CCDFG G , hGCD;M; T i
uniquely as the triple hS;E;Mi where S and E denote the nodes and edges of the
projection of G on T , and M is the set of microstep partitions rened by T . We
use this view in the rest of the dissertation.
3.2 CIRCUIT MODEL
We represent a circuit as a Mealy machine specifying the updates to the state
elements (latches) in each clock cycle. Our formalization of circuits is typical in
traditional hardware verication, but we make combinational nodes explicit to
facilitate the correspondence with CCDFGs. A circuit is a tuple M = hI;N; F i
where I is a vector of inputs; N is a pair hNc; Ndi where Nc is a set of combinational
nodes and Nd is a set of latches; and F is a pair hFc; Fdi where Fc maps each
combinational node c 2 Nc to an expression over Nc [ Nd [ I and for each latch
d 2 Nd, Fd maps each latch d to n 2 Nc [ Nd [ I where Fd is a delay function
which takes the current value of n to be the next-state value of d.
A circuit state is an assignment to the latches in Nd. Given a sequence of valuations
to the inputs i0; i1; : : :, a circuit trace ofM is the sequence of states s0; s1; : : :, where
(1) s0 is the initial state and (2) for each j > 0, the state sj is obtained by updating
the elements in Nd given the state valuation sj 1 and input valuation ij 1. The
22
pl_start = 0
tmp54 = (tmp41 << 4 + k2) ^ 
tmp49 
tmp56 = tmp54 + v1_0
newPhi = phi (0, newbin);
v1_0 = phi (v[1],tmp56);
v0_0 = phi (v[0], tmp41)
newPhi == 32
newbin = newPhi + 1
sum0 = newPhi*0x9e3779b9
pl_start = 1
tmp26 = sum0+0x9e3779b9
tmp39 = ( (v1_0 << 4) + k0) ^ 
(tmp26 + v1_0) ^ ((v1_0>>5) 
+ k1))
tmp41 = tmp39 + v0_0
tmp49 = (tmp41 + tmp26) ^ 
((tmp41 >> 5) + k3)
v[0] = v0_0;
v[1] = v1_0
return
pl_start == 1
V1_0p
tmp39
Phi
newPhi
newbin
1
Phi
V1_0
== 32
sum0
0
x
9
e
3
7
7
9
b
9
Phi
V0_0
tmp26
tmp49
tmp41
out = ((i0<<4) +i1)^i2
i0 i1 i2
out
FSM
0
out =(i0+i1)^((i0>>5)+i2)
i0 i1 i2
out
V[1]k2
k0 k1
k3
V[0]
V[1]
V[0]
Input
Pipeline 
logic
N
Y
Y
N
out = ((i0<<4) 
+i2)^(i0+i1)^((i0>>5)+i3)
i0 i1 i2 i3
out
Figure 3.2: Operation mapping between CCDFG and circuit
observable behavior of the circuit is the sequence of valuations of the outputs which
are a subset of latches and combinational nodes.
3.3 CORRESPONDENCE BETWEEN CCDFGS AND CIRCUITS
Given a CCDFG G and a synthesized circuitM , it is tempting to dene a notion of
correspondence as follows: (1) establish a xed mapping between the state variables
of G and the latches in M , and (2) stipulate an execution of G to be equivalent
to an execution of M if they have the same observable behavior. However, this
does not work in general since the mappings between state variables and latches
may be dierent in each clock cycle. To address this, we introduce EMap : ops!
Nc, mapping CCDFG operations to the combinational nodes in the circuit: each
operation is mapped to the combinational node that implements the operation;
23
the mapping is independent of clock cycles. Figure 3.2 shows the mapping for
the synthesized circuit of TEA. Recall from Section 2.1 that the FSM decides the
control signals for the circuit; the FSM is thus excluded from the mapping.
We now dene the equivalence between G and M . A CCDFG state x of G is
equivalent to a circuit state s of M with respect to an input i and a microstep
partition t, if for each operation op in t, the inputs to op according to x and i are
equivalent to the inputs to EMap(op) according to s and EMap(i), i.e., the values
of each input to op and the corresponding input to EMap(op) are equivalent, and
the outputs of op are equivalent to the outputs of EMap(op).
Given a CCDFG G and a circuit M , G is equivalent to M if and only if for
any execution [x0; x1; x2; : : :] of G generated by an input sequence [i0; i1; i2; : : :] and
by microstep partition [t0; t1; : : :] of G, and the state sequence [s0; s1; s2; : : :] of M
generated by the input sequence [EMap(i0); EMap(i1); EMap(i2); : : :], xk and sk
are equivalent with respect to tk under ik, k  0.
3.4 DUAL-RAIL SIMULATION FOR EQUIVALENCE CHECKING
We check equivalence between CCDFG G and circuit M by dual-rail symbolic
simulation (Figure 3.3); the two rails simulate G and M respectively, and are
synchronized by clock cycle. The equivalence checking in clock cycle k is conducted
as follows:
1. The current CCDFG state xk and circuit state sk are checked to see whether
for the input ik, the inputs to each operation op in the scheduling step tk are
equivalent to the inputs to EMap(op). If yes, continue; otherwise, report
inequivalence.
2. G is simulated by executing tk on xk under ik to compute xk+1 and recording
the outputs of each op 2 tk. M is simulated for one clock cycle from sk under
input EMap(ik) to compute sk+1. The outputs for each op are checked for
24
or Execution up to Given BoundMapping
Eqivalence
Constraints
Input Yes. Fixed Point Computation No
CCDFG Simulation of CCDFG
Single Clock Cycle
Simulation of Circuit
Single Clock Cycle
Equivalent?
Circuit
Figure 3.3: Dual-rail simulation scheme for equivalence checking between CCDFG and
circuit.
equivalence with the outputs of EMap(op). If yes, continue; otherwise, report
inequivalence.
3. The next scheduling step tk+1 is determined from control ow. If tk has
multiple outgoing control edges, the last microstep of tk executed is identied.
The outgoing control edge from this microstep whose condition evaluates to
true leads to tk+1.
We permit both bounded and unbounded (xed-point) simulations. In particular,
the simulation proceeds until (i) the equivalence check fails, (ii) the end of a bound-
ed input sequence is reached, or (iii) a xed point is reached for an unbounded
input sequence.
The bit-level and word-level checkers are complementary. The bit-level checker
ensures that the equivalence checking is decidable, while the word-level checker
provides the optimizations which are crucial to scalability. The word-level checker
can make eective use of results from bit-level checking in many cases. One typical
scenario is as follows. Suppose M is a design module of modest complexity but
25
is awkward to check at word-level. Then the bit-level checker is used to check
the equivalence of the CCDFG of M with its circuit implementation; when the
word-level checker is used for equivalence checking of a module that calls M , it
skips the check of M , treating the CCDFG of M and its circuit implementation
as equivalent black boxes.
3.5 TOOL IMPLEMENTATION
We rst implemented the dual-rail simulation on bit-level in the Intel Forte envi-
ronment [62], where symbolic states are represented using BDDs. However, exper-
imental results clearly show that bit-level checking does not scale (cf. Section 3.6).
Therefore, we re-implemented our equivalence checker on word-level in OCaml [58].
This is viable since word-level mappings between operations and circuit nodes are
explicit. We use bit-vectors to encode the variables in the CCDFG and the circuit;
the SMT engine checks input/output equivalence and determines control path-
s. Our word-level checker employs CVC3 SMT engine [3]. Figure 3.4 shows the
framework of our equivalence checker. Behavioral synthesis generates RTL circuits
in terms of Hardware Description Languages (HDLs), such as Verilog or VHDL.
Currently, we only developed the parser for Verilog, but we can easily extend our
HDL parser to support VHDL. Our HDL parser parses HDL les into an inter-
mediate representation for symbolic simulation. Our HDL parser and simulator
support a synthesizable subset of HDL. This subset of HDL can be synthesized
into gate-level. The CCDFG parser parses CCDFG les generated by our certi-
ed compiler. The RTL and CCDFG symbolic simulators simulate circuits and
CCDFGs simultaneously, synchronized by clock cycle following our dual-rail simu-
lation scheme. The state checker check the equivalence of the outputs of symbolic
simulators by utilizing SMT solvers.
Our checker provides three optimizations targeting dierent circuit features
(see Chapter 4). Users can specify which optimizations are involved in a particular
26
Circuits
(Generated by 
behavioral 
synthesis)
CCDFGs
(Generated by 
certified 
compiler)
User 
Configuration
HDL Parser
CCDFG 
Parser
RTL 
Symbolic 
Simulator
CCDFG 
Symbolic 
Simulator
State Cheker
SMT Solver
C
o
n
fi
g
u
ra
ti
o
n
Simulation Constraints
Dual-rail Simulator
Equivalence Checker
Figure 3.4: Framework of equivalence checker
27
Table 3.1: Bit-level equivalence checking statistics
Bit Width # of Circuit Nodes Time (Sec.) BDD Nodes
2 96 0.02 503
3 164 0.05 4772
4 246 0.11 42831
5 342 0.59 16244
6 452 12.50 39968
7 576 369.31 220891
8 714 6850.56 1197604
check. The dual-rail simulator will be automatically congured according to the
user's specication.
3.6 EXPERIMENTAL RESULTS
To establish a baseline, we use the bit-level checker on a set of CCDFGs for GCD
and the corresponding circuits synthesized by AutoESL. The experiments were
conducted on a workstation with 3GHz Intel Xeon processor with 2GB memory.
The checking time bound is set up to 4 hours.
Table 3.1 shows the results of bit-level SEC for GCD. GCD contains a loop
whose number of iterations depends on the inputs. Since all operations are de-
composed into bit-level, the running time grows exponentially with bitwidth. For
8-bit GCD, SEC takes about 2 hours. Pure bit-level SEC is thus not feasible for
more complex designs.
To experiment with our word-level checking scheme, we have checked several
designs which have dierent design features. The statistics are shown in Table 3.2.
\-" signies \out of time or memory". DCT (Discrete Cosine Transform) is a widely
used algorithm in image processing domain, which contains sequential computation
28
Table 3.2: Word-level equivalence checking statistics
Design GCD TEA DCT 3DES
C Code Size (# of Lines) 14 12 52 325
RTL Size (# of Lines) 364 1001 688 18053
Time (Seconds) - - 30.1 -
Memory (Megabytes) - - 49.2 -
without loop. SEC for DCT only takes half a minute. Unfortunately, we cannot
nish the checking for GCD, TEA, and 3DES. These designs either requires a very
expensive x-point computation or have complex modular hierarchies. Next, we
present how to further optimize our checking scheme in Chapter 4.
29
Chapter 4
OPTIMIZATIONS
4.1 MOTIVATION AND OVERVIEW
In Chapter 3, we proposed a framework for certifying behaviorally synthesized
RTL through SEC with our CCDFG representation. However, we realized that
naive word-level checking ran into scalability issues. In this chapter, we present
a suite of optimizations for the SEC step above, which exploit both the explicit
control and data ow representations in the CCDFG and the module structures
in the ESL description. We have applied these optimizations in verication of
RTL synthesized by AutoESL Our experiments show that they scale SEC to tens
of thousands of lines of synthesized RTL from complex behavioral specications
(e.g., unbounded loops, modules, etc.), making it viable for industrial designs. We
know of no other SEC framework that can handle behaviorally synthesized RTL
of such complexity.
4.2 CUT-POINTS
The cutpoint optimization involves pre-verifying comparison of specic CCDFG
operations and their circuit implementations o-line. Subsequently, during SEC,
these operations are replaced in the CCDFG and RTL by equivalent symbols. Note
that only the equivalences (not computations) are relevant to SEC; if the inputs
to a cutpoint are equivalent, their outputs can be replaced by equivalent symbols,
causing only equivalences (not outputs themselves) to be propagated.
30
We utilize two types of cutpoints, combinational and sequential. Combinational
cutpoints are applicable to combinational portions, and have been studied exten-
sively [45]. RTL designs with complex combinational circuits are generated due
to transformations such as loop unrolling: in the TEA example, the behavioral
synthesis tool can fully unroll the for loop, creating complex combinational cir-
cuits by aggregating operations from dierent iterations. Sequential cutpoints cut
sequential circuits and keep complex expressions from propagating across clock
cycles.
In the TEA example (Figure 3.2), the scheduling step starting with the condi-
tional pl start==1 and ending with the assignment pl start=1 is implemented as
a combinational block that can be cut at all operations, e.g., the one computing
tmp54; the equivalence of this operation with the corresponding RTL is certied
separately (e.g., by theorem proving). On the other hand, the operation that com-
putes tmp49 can be used as a sequential cutpoint since it connects two scheduling
steps.
To explain the role of post-scheduling CCDFGs in cutpoint optimization, note
that the ESL specication is unclocked while the RTL is clocked. Furthermore,
after application of high-level transformations, the RTL has little correspondence
in internal operations with the behavioral description, making it dicult to iden-
tify cutpoints. However, this problem is eliminated in our framework since there
is a readily available correspondence with the post-scheduling CCDFG, e.g., the
operation-to-resource mapping, which provides natural candidates for cutpoints.
4.3 CUT-LOOP OPTIMIZATION
A major challenge in SEC is termination, which typically requires expensive xed-
point computation. Termination becomes a problem when the input description
contains unbounded loops. Consider the CCDFG of the Greatest Common Divisor
(GCD) algorithm shown in Figure 4.1. The bit-level symbolic simulation for GCD,
31
i n t gcd ( i n t a , i n t b )
f
i n t t ;
do f
i f ( a >= b) a=a b ;
e l se f t=a ; a=b ; b=t ;g
g while ( b != 0 ) ;
return a ;
g
t=a;a=b;b=t
a=A
b=B
a=a-b
True
False
a >= b
return a
b!=0
False
True
Micro Step
Scheduling 
Step
Figure 4.1: C source code and CCDFG for GCD
even for 8-bit integers, involves more than 6850 seconds and 1197606 BDD nodes
(cf. Section 3.6). A naive xed-point computation at word-level is also expen-
sive. Even for designs with deep bounded loops (e.g., TEA), full unrolling is too
expensive for both bit-level and word-level simulations.
Our solution is the cut-loop optimization, which \cuts" the loop, reducing the
xed-point computation to three checks, i.e., at the entry, body, and exit. The
idea is inspired by theorem proving approaches to verifying software loops. At
entry, we check equivalence between the CCDFG and the RTL for the path to the
initial loop entry. For the body, we check that if (1) equivalence is maintained at
the loop join point, and (2) the loop does not exit, then equivalence is maintained
after one iteration. For the exit, we check that if (1) equivalence is maintained
at the loop join point, and (2) the loop exits, then equivalence is maintained
at the loop exit. The loop structure and entry point information are available
from the synthesis tool. The checks above are inspired by inductive assertions in
software verication [22, 31]: the three checks are essentially the proof obligations
32
t=a; a=b; b=t
a=A
b=B
a=a-b
True
False
a >= b
return a
b!=0
False
True
a=A
b=B
t=a; a=b; b=t
a=a-b
True
False
a>=b
b!=0
t=a; a=b; b=t
b!=0
a=a-b
True
False
a >= b
return a
False
Loop Entry
Loop Body
Loop Exit
True
Figure 4.2: Cut-loop optimization for GCD example
discharged by a verication condition generator, if we think of equivalence with
RTL as the invariant maintained by the loop. Using ACL2, we proved that the
checks guarantee word-level equivalence over the entire loop execution. The proof
follows a reasoning analogous to that used in justifying the use of loop invariants
to cut loops for program verication using inductive assertions.
We illustrate cut-loop optimization on the GCD example in Figure 4.2. At
the loop entry, the check that a and b are equivalent to their RTL counterparts
is trivially true since they are inputs. For the body check the condition b 6= 0 is
applied to ensure the iteration does not exit, and for the exit check the condition
b = 0 is applied to ensure the loop exits. For both body and exit checks, the
condition being checked is that if a and b are equivalent before executing a  b
then they are equivalent after one iteration. With this optimization, word-level
SEC on GCD nishes within two seconds. The cut-loop optimization is also useful
33
des_crypt
FSM
data_control st
a
rt
re
s
e
t
d
o
n
e
tmp12
data_out
data_in
key
data_out
FSM
start
reset
start
reset
done
IP f InvIP
data_out
done
data_control
key
date_in
des_crypt
three_des_crypt
Input
tmp12 = data_in | tmp13 
data = call des_crypt (tmp12, key )
Input
tmp32= data_in^ tmp56 
tmp15 = call IP (key, tmp32)
..
.
..
.
three_des_crypt
des_crypt
CCDFG Circuit
Figure 4.3: Modular SEC for 3DES
for deep bounded loops, e.g., we achieved major speed-up for word-level SEC on
TEA (cf. Section 4.5).
Note that loop detection is greatly simplied since CCDFGs are derived from
ESL designs by applying primitive transformations.
4.4 MODULAR ANALYSIS
Synthesized RTL is often large and complex, e.g., for 3DES design, the behav-
ioral synthesis tool generates 18053 lines of Verilog. Behavioral synthesis reduces
RTL size via modular reuse: without modules, the RTL for 3DES would be 128K
lines.Modules may be present in input description or introduced by behavioral
synthesis. To support modules, CCDFGs are extended with function calls. An
example function invocation in the 3DES CCDFG is shown in Figure 4.3.
With modules, a given behavioral description corresponds to several CCDFGs
34
(each corresponding to a module). A module can be either combinational or se-
quential. A combinational module returns in the same clock cycle in which it is
invoked, while a sequential module takes several cycles. Note that the top-level
CCDFG may not capture all the scheduling steps since some are in other sequen-
tial modules. In the synthesized RTL, there is a module for each CCDFG. In
addition to RTL code implementing functionality, there is additional code for in-
terfaces, e.g., a module commonly needs reset, start, and allow signals besides
input/output data signals.
One naive approach to handle modules is to unfold them, causing each module
to be analyzed at each invocation. We prefer compositional analysis of each module
separately. Our scheme works as follows.
 For each moduleM , the CCDFG and RTL forM are checked for equivalence
separately.
 When verifying a module M 0 that invokes M , the invocation of M in the
CCDFG and RTL of M 0 are replaced by equivalent uninterpreted functions.
The equivalence between function invocation in CCDFG and module interfacing
mechanism in RTL is pre-certied. Modular analysis is possible because of explicit
correspondence between the CCDFG and the RTL of a module: since we use the
same module structure used in the synthesis, the decomposition does not introduce
over-approximations.
Currently, we do not handle recursive modules since recursions in ESL descrip-
tions are typically removed by compiler transformations; however, modular analysis
can be extended to recursion by replacing the callee with a \module summary",
analogous to procedure summaries in software verication [1].
35
Table 4.1: Designs, features, and optimizations
Designs Features Optimizations
GCD Unbounded Loop Cut-Loop
DCT Sequential without Loop Cutpoint
TEA Bounded Loop Cut-Loop
Unrolled Loop Cutpoint
DES Bounded Loop Cut-Loop
Unrolled Loop Cutpoint
High Sequential Complexity
3DES Bounded Loop Cut-Loop
Unrolled Loop Cutpoint
High Sequential Complexity Modular Analysis
3DES key Bounded Loop Cut-Loop
Unrolled Loop Cutpoint
High Sequential Complexity Modular Analysis
High Combinational Complexity
4.5 EXPERIMENTAL RESULTS
Table 4.1 illustrates the designs used to evaluate our optimizations. Each design is
synthesized by a behavioral synthesis tool. The designs are selected carefully to ex-
ercise dierent facets of our framework. Encryption algorithms, e.g., TEA, DES,
3DES, and 3DES key (3DES with key generation). DES, 3DES and 3DES key
contains bounded loops and benet from cut-loop; their sequential and combina-
tional complexities also illustrate the role of cutpoints. 3DES and 3DES key have
modular structures and modular analysis is vital to discharge their SEC. DES was
deliberately synthesized without modules to further investigate the role of modular
analysis. All experiments were conducted on a workstation with 3GHz Intel Xeon
36
Table 4.2: Word-level equivalence checking statistics
Design RTL Size (# Lines) Optimizations Time (Secs) Memory (MB)
GCD 364
NO - -
CP - -
CP + CL 2 4.1
DCT 688
NO 71 92.16
CP 30.1 49.2
TEA 1001
NO - -
CP 116 141.3
CP + CL 15.6 24.6
DES 11520
NO - -
CP 5896 614.4
CP + CL 1482 426.4
3DES 18053
NO - -
CP + MA 872.5 114.7
CP + MA + CL 355.7 59.4
3DES key 79976
NO - -
CP + MA 2868.5 307.2
CP + MA + CL 2351.7 307.2
processor with 2GB memory.
Table 4.2 shows the results of word-level SEC for all the designs from Table 4.1.
Here, \-" signies \out of time or memory", \CP" for cutpoints, \CL" for cut-loop,
and \MA" for modular analysis. The \NO" column represents \no optimizations":
it is clear that without the optimizations, SEC cannot handle long computation
sequences or loops. Since DCT contains only sequential computations and no
modules, cut-loop and modular analysis are not applicable; however, cutpoint
optimization reduces the symbolic simulation cost to about half, in both time
37
and memory usage. Cutpoints, together with modular analysis, can handle long
computation sequences and bounded loops, (e.g., TEA, 3DES, and 3DES key), but
blows up on xed-point computation for unbounded loops (e.g., GCD), underlining
the need for cut-loop. The cut-loop optimization handles unbounded loops, while
also reducing the time and memory usage for designs with bounded loops. The
savings from cut-loop are relatively less for 3DES key since the design contains
large combinational computations (for generating the key) which overshadow loop
unrolling cost. The results on DES highlight the importance of modular analysis
when possible: although the RTL is smaller than 3DES and 3DES key, the time and
memory usage is higher due to lack of modules (and hence, modular analysis); for
3DES and 3DES key, even the behavioral synthesis tool fails without modules. The
results indicate that word-level SEC with our optimizations can scale to realistic
designs. Note that each of DES, 3DES, and 3DES key is over 10; 000 lines of
RTL, and 3DES key (even with modules) involves about 80; 000 lines. We know of
no other framework that can apply SEC on behaviorally synthesized RTL at this
scale.
38
Chapter 5
SEC FOR SYNTHESIZED LOOP PIPELINES
5.1 MOTIVATION AND OVERVIEW
Loop pipelining is a critical transformation in behavioral synthesis to reduce the
latency of designs with loop structure by producing temporal overlap of successive
loop iterations. It is available in most state-of-the-art tools, (e.g., AutoESL). How-
ever, it induces retiming and out-of-order executions; furthermore, the mapping
of internal operations is lost between the sequential description and the pipelined
RTL. This rules out standard SEC techniques for their comparison. In particular,
some key optimizations (e.g., cutloop) become inapplicable.
In this chapter, we discuss the challenges with loop pipelines and present an
equivalence checking approach for certifying synthesized hardware designs in the
presence of loop pipelining transformations. We have applied our approach on
industrial-size designs with thousands of lines of RTL, synthesized by AutoESL.
This scalability is derived from tight integration with the synthesis ow. Instead
of directly comparing the synthesized RTL with the sequential description, we de-
velop an intermediate pipeline reference model. This model provably preserves the
semantics of the sequential description. However, our model generation algorith-
m is parameterized by pipeline parameters, whose values are obtained from the
synthesis tool; this ensures that the structure of the generated model is similar to
that of the synthesized RTL, and enables internal operation mapping between the
reference model and the RTL.
39
#def ine N 100
i n t p i p e ( i n t a [N] ) f
i n t i ;
i n t r e s u l t = 0 ;
f o r ( i = 0 ; i < N; ++i ) f
i n t tmp1 = r e s u l t + a [ i ] ;
a [ i ] = tmp1 ;
r e s u l t = tmp1 + 1 ;
g
return r e s u l t ;
g
S2 S3 S4
Execution order before pipelining
Execution order after pipelining
S2 S3 S4 S2 S3 S4
S2 S3 S4
S2 S3 S4
S2 S3 S4
S2 S3 S4
(a) C code with loop (b) Execution before and after pipelining
Figure 5.1: Example of loop pipeline
5.2 CHALLENGES WITH LOOP PIPELINES
Loop pipelining allows multiple successive iterations of a loop to operate in parallel
by executing a new iteration before the previous iteration completes. Consider
pipelining the loop in Figure 5.1 (a). Figure 5.1 (b) shows the execution orders of
the scheduling steps in the loop body before and after pipelining. In the sequential
design, execution of iteration i involves reading the value of a[i] from the memory in
S2, adding i and a[i] in S3, and storing new value to the memory and computation
of result in S4. However, with pipelining, iteration i+1 is initiated before iteration
i completes.
The result of overlapping executions is a signicant dierence in the schedule
of operations between the CCDFG of the sequential design and the RTL generated
from the pipeline. Each scheduling step of the pipeline is composed of a number
of scheduling steps of the sequential design; there is no longer a direct operation
mapping between the CCDFG and RTL. Furthermore, due to the dierence in the
40
execution order of the scheduling steps, the controlling nite-state machines are
also dierent. A direct SEC between the two reduces to comparison of their input-
output relations, which is prohibitively expensive for loops with many iterations.
5.3 SEC WITH REFERENCE MODEL
Our solution to the above problem is to develop a reference pipelining transforma-
tion on CCDFGs [29]. Given a CCDFG G and certain pipeline parameters (see
below), we generate a new CCDFG G0 by pipelining the loops. Note that our
transformation is dierent from that used by the synthesis tool to generate the
pipelined RTL. The synthesis tool transformation includes algorithms and heuris-
tics to determine how many iterations to pipeline, etc.; on the other hand, our
algorithm merely takes such information as parameters to create G0. In fact, we
obtain this information from the synthesis tool itself. Thus the output CCDFG G0,
if successfully generated by our algorithm,1 is guaranteed to have close structural
correspondence with the synthesized RTL. On the other hand, irrespective of the
actual value of these parameters, G0 is guaranteed to be semantically equivalent
to G and can therefore be soundly used instead of G for SEC.
The following denition characterizes the loops handled by the algorithm.
Denition 4 Pipelinable Loop. For a CCDFG G , hGCD;M; T i and for T0 
T , we say that T0 induces a \pipelinable loop" if (1) the projection of G on T0 is
a cycle C, and (2) in the projection of G on T there is a unique node (called the
\entry node") in C with a predecessor outside C and a unique node (called the
\exit node") in C with a successor outside C.
1Our algorithm does not use semantic invariants of the program being transformed. Thus we
may fail to pipeline a loop for a given number of iterations (and report a spurious hazard) when
in fact such a pipeline is hazard-free. However, in practice we have not seen a case where the
synthesis tool generates a pipeline with specic parameters and our algorithm reports a spurious
hazard on those parameters.
41
Remark. The notion of pipelinable loops is more restrictive than the common loop
denition in programming languages. In particular, a pipelinable loop has a single
exit and loop nesting is disallowed. Our denition is based on the kind of loops
that are actually pipelined by behavioral synthesis tools. For instance, if a design
contains nested loops, then the inner loop can be unrolled completely (possibly by
compiler transformations) before the outer loop can be pipelined.
Algorithm 1 PIPELINELOOP(L = hS;E;Mi, I, N)
1: S 01  GenerateSchedulingSteps(S; I;N)
2: hS 02;M 01i  GenerateP ipelineRegs(S 01;M;E; I)
3: E 01  GenerateEdges(S 02; E; I;N)
4: hS 03;M 02i  GenerateForwarding(S 02;M 01; E 01; I)
5: return hS 03; E 01;M 02i
Given CCDFG G, our reference transformation replaces each loop L in G with
the pipelined renement of L as described in Algorithm 1. Here I is iteration
interval, which indicates how many clock cycles later a new iteration is to be \fed"
into the pipeline, and N is the number of scheduling steps in L. Values of these
parameters are readily available from behavioral synthesis. Figure 5.2 illustrates
the use of the algorithm on our simple example. We now discuss the dierent steps
of the algorithm in greater detail.
Algorithm 2 describes the construction of scheduling steps of the pipelined C-
CDFG. The algorithm simulates the process of \feeding" a new loop iteration into
the pipeline until the pipeline is full. Consider the sequence of iterations shown in
Figure 5.3. The output is an array (initially empty) of graphs. Each graph repre-
sents the projection of the reference pipeline CCDFG at a single scheduling step.
We rst build the nodes of each graph in the array (Lines 3-6); we then compute
the edges within each graph (Lines 8-14). The set of nodes of each graph in SG
is determined by I and N . The algorithm updates SG for every iteration. If the
42
S’2
S’3
S’5
S’6
CCDFG after pipelining
S2
S3
S4
S2
S3 S2
S4 S3 S2
Pipeline
 Prologue
Pipeline
 Full
Pipeline
 Epilogue
S4
S’4
CCDFG before pipelining Pipelined CCDFG construction
S5 S5
(a) (b) (c)
Figure 5.2: Input and output CCDFGs of loop pipelining transformation
pipeline is not yet full, i.e., can accept a new iteration but no iteration is completed
yet (Line 3), then a new iteration is introduced and merged with the existing iter-
ations in the pipeline by subroutine mergeIteration. Subroutine mergeIteration
merges each scheduling step in the new iteration with the corresponding steps al-
ready in pipeline, returns new scheduling steps as shown in Figure 5.3 (b), (c),
(d). To model the exit, the pipeline enters the \ushing" stage in which iterations
are completed without new iteration being introduced. The pipeline full stage cor-
responds to the new loop body for the pipelined CCDFG while the prologue and
epilogue correspond to the entry and exit.
We now build the edges for each graph in SG. The goal is to ensure that the
new control ow respects that of the input loop. The process is demonstrated in
Figure 5.4 (a). Recall that a scheduling step of the pipeline involves a number of
scheduling steps of the original CCDFG (across several iterations). To ensure that
the original control ow is respected, a scheduling step s0 of the pipeline is executed
43
following the iteration order. This is achieved by adding edges enforcing the eval-
uation of microsteps from left to right. For instance, in S 04 shown in Figure 5.4 (a),
an edge is created to connect S4 and S3. Since S4 is from an earlier iteration, the
direction is from S4 to S3. The edge condition !exitcond states that loop does not
exit. If the loop exits at iteration i, all iterations from (i+1) must be skipped: this
is ensured by inserting the exit condition on all such edges. Subroutine buildEdge
creates the correct edge condition according to the control ow.
Algorithm 2 GenerateSchedulingSteps (S, I, N)
1: SG  ;;
2: iter  0 /*loop iteration*/
3: while iter  I < N do
4: SG  mergeIteration(SG; S; I; iter)
5: iter  iter + 1
6: end while
7: /*build new edges within one single scheduling step */
8: for each step s0 in SG do
9: for each step pair (s0[pos]; s0[pos+ 1]) in s0 do
10: e0  buildEdge(s0[pos]; s0[pos+ 1])
11: s0  append(s0; e0)
12: end for
13: end for
14: return SG
Algorithm 3 inserts \pipeline registers" between iterations to facilitate correct
data ow and prevent variables from being overwritten before being consumed. In
a CCDFG, the eect of pipeline registers is mimicked using temporary variables
as follows. We rst compute all program variables that may be overwritten before
being consumed (Line 2); this constitutes the variables that potentially require
44
S2
S3
S4
Iter = 0; Iter = 1 Iter = 2
Scheduling step 
before pipelining
Scheduling step 
after pipelining
S2
S3
S4
S2
S3
S4
S2
S3 S2
S4 S3
S4
S2
S3
S4
Legend:
(a) (b) (c) (d)
S’2
S’3
S’4
S’5
S’6
S’2
S’3
S’4
S’5
S’6
S’2
S’3
S’4
S’5
S’6
S’2
S’3
S’4
S’5
S’6
Figure 5.3: Construction of scheduling steps
pipeline registers. To nd such variables, we compare the distance between the
producer msp and the last consumer msc; if the distance is greater than I, v is
assigned the new data value of the next iteration before current iteration's val-
ue has been fully consumed; this warrants insertion of pipeline variables in every
scheduling step between msp and msc. The value is propagated every clock cycle
following the CCDFG data ow. In Figure 5.5, variable %a addr is computed in
S2 and the last use scheduling step is S4. The distance is greater than I = 1,
therefore, temporary variables a addr pipe1 and a addr pipe2 are inserted. Sub-
routine addP ipelineReg generates new microsteps for assignments of the pipeline
variables, create new edges to integrate these microsteps into the data path, and
updates the schedule.
45
S2
S3 S2
S4 S3 S2
S3S4
S4
S2
S3 S2
S4 S3 S2
S3S4
S4
build the new edges within 
one single scheduling step
build the new edges between 
scheduling steps and back edges
! exitcond TRUE
! exitcond
Loop Exit
e
x
it
c
o
n
d
! exitcond
e
x
it
c
o
n
d
e
x
it
c
o
n
d
! exitcond
TRUE
exitcond
Dead Edges
(a) (b)
S’2
S’3
S’4
S’5
S’6
S’5
S’6
TRUE
S’2
S’3
S’4
Figure 5.4: Construction of edges
Algorithm 3 GeneratePipelineRegs (S, M , E, I)
1: S 0  S;M 0  M
2: Vpr  getP ipelineRegisterV ars(S;M;E; I)
3: for each variable v in Vpr do
4: msp  getProducer(v)
5: msc  getLastComsumer(v)
6: hS 0;M 0i  addPipelineReg(S 0;M 0; E;msp;msc)
7: end for
8: return hS 0;M 0i
Algorithm 5.3 shows the construction of edges governing the control ow of the
pipelined CCDFG. Figure 5.4 (b) shows how to build edges between new scheduling
steps (Lines 3-6). One example is the edge from S2 in S
0
2 to S4 in S
0
3. Because the
pipeline is still in prologue stage, the edge condition is that loop does not exit.
46
Algorithm 4 GenerateEdges (S, E, I, N)
1: E 0  ;
2: /*build the edges between new scheduling steps*/
3: for each step pair(S[i]; S[i+ 1]) in S do
4: e0  buildEdge(S[i]; S[i+ 1])
5: E 0  append(E 0; e0)
6: end for
7: /*build the back edge*/
8: ssrc  S[N   1]; sdst  S[N   I]
9: ebackedge  buildEdge(ssrc; sdst)
10: E 0  append(E 0; ebackedge)
11: /*build the early exit edge*/
12: i N   1
13: while i < sizeof(S)  1 do
14: e0  buildEdge(S[i]; sloopexit)
15: E 0  append(E 0; e0)
16: i i+ I
17: end while
18: return hE 0i
The back edge of the new loop connects the last scheduling step of the pipeline
full stage to the rst one. S 0[N   1] is the last one and S 0[N   I] is the rst step
in the pipeline full stage. Finally, for an unbounded loop, exit can occur in any
iteration. Thus, we must allow the pipeline to start ushing in any iteration, even
when the pipeline is not full (Lines 12-17). In the example shown in Figure 5.4 (b),
the exit point of the loop is in S2, therefore in pipeline epilogue, the edge from
S4 to S3 will never be valid. This is because the loop would have already exited
and the S3 and S4 of the new iteration will not execute. The dead edges will be
47
removed to simplify the nal CCDFG.
Algorithm 5 GenerateForwarding (S, M , E, I)
1: /*nd all loop carried dependencies*/
2: Dlc  getCarriedDependencies(S;M;E)
3: S 0  S;M 0  M
4: for each pair (ow, or) in Dlc do
5: if checkForwarding(or; I; S
0) then
6: hS 0;M 0i  moveOp(ow; or; S 0; E;M 0)
7: else
8: return ERROR
9: end if
10: end for
11: return hS 0;M 0i
A critical puzzle is computation of data forwarding paths along pipeline itera-
tions (Algorithm 5). Data forwarding is critical to achieving aggressive pipelining
and eliminating data hazards. The rst key observation is that forwarding is only
necessary for loop carried dependencies, which extend back to the previous itera-
tion. Dlc denotes a list of dependencies and Subroutine getCarriedDependencies
nds all loop carried dependencies. Each dependency is pair of operations (ow; or),
ow is the last write operation in the loop body and or is the rst read operation.
Subroutine checkForwarding checks if the data forwarding is possible (i.e.,
whether the value is computed before use) for these variables in the scheduling
steps of the pipeline. We then implement forwarding using so-called \ nodes".
 nodes are special operators in compiler transformations and are widely used in
resolving conditional branches in a number of compilers, and are used to postpone
computation of control ow until run time. In particular, a  node is introduced in
a basic block which has multiple predecessors; the values of variables in a  node
48
%i = phi (0, %indvar);
%result = phi (0, %result_1)
%exitcond = icmp eq %i 100
%a_load = load %a_addr
%a_load = load %a_addr 
Store %tmp1 %a_addr_pipe2
Y
N
indvar =  add %i 1
%exitcond == 1
%a_addr = getelemtptr %A %i
%tmp1 = add %a_load %result
%result_1 = add %tmp1 1
S2
S3
S4
%a_load = load %a_addr 
%tmp1 = add %a_load %reslult
S3
%result = phi(0, %result_1)
Forwarding
Next iteration
%i = phi (0, %indvar);
%result = phi (0, %result_1)
%exitcond = icmp eq %i 100
%a_load = load %a_addr
Y
N
indvar =  add %i 1
%exitcond == 1
%a_addr = getelemtptr %A %i
S2
%a_addr_pipe1 = load %a_addr
%a_addr_pipe_1 = load %a_addr
%a_addr_pipe2 =  %a_addr_pipe1
%a_addr_pipe2 =  %a_addr_pipe1
Figure 5.5: Pipeline registers and forwarding
for a specic execution are given by the specic block which actually precedes the
node in that execution. To understand its utility for data forwarding, consider
Figure 5.5. In the non-pipelined design  operators can occur only in schedul-
ing step S2. The valid value of variable %result is computed by the  node in
scheduling step S2. Since we desire to execute scheduling steps S2 and S3 within a
single scheduling step, we move the  from S2 to S3 and forward the value directly
from the producer to the consumer. In general, to implement pipeline forwarding,
we need to relocate the position of the  operator for a variable to immediately
49
before its rst consumer, also update the assignment of  node according to the
new control ow. The \move" is implemented in moveOp, which will generate a
new scheduling S 0 and a new microstep partition M 0.
5.4 EXPERIMENTAL RESULTS
We implemented the loop pipelining algorithm on top of our certication frame-
work for behavioral synthesis. We ran our tool on a collection of pipelined designs
synthesized by AutoESL.
Table 5.1 shows the results. Our framework successfully handled SEC for syn-
thesized designs with pipelined loops involving several thousand lines of RTL with-
in a reasonable time and memory bounds. Note that this success on pipelines
depends on the applicability of other optimizations during SEC. The reason is
that because of the presence of non-trivial loops, SEC without cut-loop optimiza-
tion requires an expensive xed-point computation which runs out of memory and
time. For all designs, brute-force SEC between the unpipelined CCDFG and the
RTL times out. SEC between the pipelined CCDFG and the RTL can mostly
nish. With the optimizations applied, SEC nishes with reduced memory and
time usages. The results thus support our preference to compare the RTL with
a closely resembling pipelined CCDFG that facilitates the optimizations, rather
than develop a specialized SEC algorithm for pipelines.
50
T
ab
le
5.
1:
L
o
op
p
ip
el
in
in
g
ex
p
er
im
en
ta
l
re
su
lt
s
D
es
ig
n
R
T
L
A
p
p
.
D
om
ai
n
L
o
op
In
fo
.
P
ip
el
in
e
In
fo
.
W
it
h
ou
t
O
p
t.
W
it
h
O
p
t.
#
li
n
e
In
te
r-
D
ep
th
O
p
er
-
F
or
w
-
P
ip
el
in
e
M
em
.
T
im
e
M
em
.
T
im
e
va
l
at
io
n
s
ar
d
in
g
R
eg
is
te
r
(M
B
)
(S
ec
)
(M
B
)
(S
ec
)
M
em
or
y
O
p
29
1
M
em
or
y
op
er
at
io
n
1
4
18
2
2
24
38
4
0.
3
T
E
A
38
3
C
ry
p
to
gr
ap
h
y
1
4
28
4
2
-
-
40
6.
2
X
T
E
A
48
3
C
ry
p
to
gr
ap
h
y
1
3
37
4
1
-
-
52
7.
8
C
O
R
D
IC
48
5
D
at
a
p
ro
ce
ss
in
g
1
3
31
4
0
38
7.
9
5
0.
9
S
m
it
h
W
at
er
51
7
D
at
a
p
ro
ce
ss
in
g
2
3
73
3
0
-
-
13
4
50
.2
F
IR
61
0
S
ig
n
al
p
ro
ce
ss
in
g
3
5
27
3
1
76
3
12
7.
4
63
10
.8
Y
U
V
T
oR
G
B
75
6
Im
ag
e
p
ro
ce
ss
in
g
2
6
77
1
5
-
-
33
5
12
8.
9
M
ot
io
n
C
om
p
12
48
Im
ag
e
p
ro
ce
ss
in
g
1
3
53
3
0
43
4
13
2.
2
50
11
.4
D
E
S
32
92
C
ry
p
to
gr
ap
h
y
1
3
17
2
2
46
8
36
4.
7
25
7
16
3.
3
51
Chapter 6
SEC FOR SYNTHESIZED FUNCTION PIPELINES
6.1 MOTIVATION AND OVERVIEW
Function pipelining (a.k.a. system-level pipelining) is an important and subtle
optimization. It aims to improve the quality of synthesized RTL implementations,
by allowing multiple successive transactions of a function to execute concurrently.
Most state-of-the-art behavioral synthesis tools, e.g., AutoESL, CatapultC, and
Cynthesizer, support function pipelining. However, it is a complex transformation
and consequently error-prone. An error in the transformation can manifest itself
as subtle bugs in scheduling, binding, or generation of controlling FSM for the
synthesized design. For instance, incorrect scheduling can cause two operations to
mistakenly overlap in the same clock cycle and data hazards may be introduced
by incorrect pipeline forwarding. Thus, sequential equivalence checking support
for verifying correctness of the synthesized pipelines is critical in enabling wide
adoption of behavioral synthesis.
Function pipelining introduces overlapping execution, which leads to a signif-
icant dierence between the behavioral specication and the RTL. Furthermore,
typical bugs are not likely to be exposed by feeding the pipeline with one trans-
action: subtle corner cases typically involve the overlapped execution of multiple
transactions at dierent pipeline stages. Therefore, SEC for function pipelines
must account for all possible input sequences. Particularly, it must account for
insertion of arbitrary \bubbles", i.e., pipeline stalls between the input sequences.
A brute-force SEC approach directly comparing the input/output relations of the
52
i n t p i p e ( i n t in1 , i n t i n 2 )
f
s t a t i c i n t a = 2 ;
i n t b , c ;
b = a  i n 1 ;
a = b + in2 ;
c = a  b ;
return c ;
g
%b = mul %a_1 %in1
%tmp = add %b %in2
%res = mul %tmp %b
ret %res
%a_1 = load @ a
store %tmp @ a
(a) C Source Code of a Function (b) Corresponding CCDFG
Figure 6.1: Example of function pipeline
behavioral specication and the synthesized RTL does not scale.
We present an approach to certifying the synthesized function pipelines. Our
approach is to break the certication into two steps. (1) we develop a reference
function pipelining transformation, which takes certain pipeline parameters from
behavioral synthesis to generate a pipeline reference model. (2) we check the
equivalence between the reference model and the RTL implementation. Our ap-
proach eciently reduces the complexity brought by bubbles in pipelines. The
mapping between behavioral level operations and RTL functional units is still p-
reserved, therefore some key optimizations, such as cutpoints [30], are applicable.
We demonstrate the eciency and scalability of our approach on a set of industrial-
strength designs synthesized by AutoESL.
53
MUL ADD MUL
(a) Without Function Pipelining
(b) With Function Pipelining
MUL ADD MUL MUL ADD MUL
MUL ADD MUL
MUL ADD MUL
MUL ADD MUL
3 cycles
1 cycles
Figure 6.2: Dierence between un-pipelined version and pipelined version
6.2 CHALLENGES WITH FUNCTION PIPELINING
Function pipelining improves the throughput of the synthesized circuit design by
allowing operations from consecutive transactions of a function to execute concur-
rently. The pipelined function can accept a new input before the previous ones
complete. Figure 6.2 compares the execution between the un-pipelined version and
a pipelined version of the design shown in Figure 6.1. There are three operations:
two multiplies and one add. Without function pipelining, the circuit can accept a
new input every three clock cycles. However, with function pipelining it can accept
a new input every clock cycle: thus, function pipelining can dramatically improve
the circuit throughput. However, the resource usage may increase, since the two
multiply operations cannot share the same multiplier now.
Behavioral synthesis generates handshake signals to implement the synchro-
nization between the synthesized pipeline and its surrounding circuits [60]: these
signals include start, done and allow, as shown in Figure 6.3. The start signal
indicates that there are valid inputs ready for execution in the pipeline and the
54
FSM
start
allow
done
Figure 6.3: Hardware interface
allow signal indicates that the pipeline is ready to start a new transaction in the
next clock cycle. The handshake happens when both start and allow are high. The
done signal indicates that the pipeline produced some valid output data. However,
when the pipeline is ready to accept a new input (i.e., when allow is high), the
upstream circuit may not be able to get the new input data ready (i.e., start is
low); in this case, a bubble is inserted into the pipeline. For instance, consider the
pipeline shown in Figure 6.2 (b). The pipeline can start a new transaction every
cycle. However, since there is no input at the third cycle and a new input comes at
the fourth cycle instead, one bubble is inserted into the pipeline. When there are
bubbles in the pipeline, the pipeline typically disables the corresponding functional
units to save power. Correctly disabling the idle functional unit without eect-
ing the rest of the pipeline is challenging and error-prone. Therefore, equivalence
checking of function pipeline must carefully take bubbles into account.
In addition to bubbles, complexity of SEC in function pipelining comes from
overlapping execution of multiple transactions. It leads to a signicant dierence
in the schedule of operations between the CCDFG of the sequential design and
the RTL implementing the function pipeline. Furthermore, overlapping execution
55
leads to a dierent FSM. Function pipelines may have fewer states, but each state
executes more operations. Thus, standard SEC techniques are not eective.
Our top-level approach for function pipeline verication has analogues to pre-
vious SEC approach for loop pipelines [29], viz., developing a reference model for
the pipelined CCDFG that is semantically equivalent to the sequential design and
can be used for SEC with the RTL. However, the reference model generation for
function pipelines is inherently dierent from loops and involves subtle challenges
not encountered in loop pipelines, leading to drastically dierent algorithms. A
major dierence between loop and function pipelines is that function pipelines
must account for arbitrary bubbles due to non-determinism in function invocation
latency, but loop pipelines do not. In loop pipelines, an FSM controls when to
start a new loop iteration. The FSM is part of the synthesized RTL, and this xes
the execution of loop iterations. For function pipelines, starting a new transaction
is determined by upstream circuits, and is a runtime decision. To fully certify the
pipeline, all bubble insertion scenarios must be accounted for. A naive approach is
to build one pipelined CCDFG for each such scenario. Therefore, to cover all these
scenarios, we have to construct many pipelined CCDFGs. Figure 6.4 (a) shows
the CCDFG before pipelining, which has three scheduling steps. Figure 6.4 (b),
(c), (d), (e) show the pipelined CCDFGs with dierent numbers of bubbles in-
serted in dierent stages. To certify a function pipeline, we have to apply SEC
between all possible pipelined CCDFGs and the synthesized RTL implementation.
Unfortunately, the number of such pipelined CCDFGs is exponential in the num-
ber of scheduling steps of the CCDFG before pipelining, making this approach
impractical.
6.3 SEC FOR FUNCTION PIPELINING
Our approach entails building a pipelined reference model, while still avoiding the
exponential cost due to bubble insertion mentioned above.
56
(a) CCDFG 
before pipelining
(b) Pipelined CCDFG 
without bubble
(c) Pipelined CCDFG 
with one bubble in the 
second transaction
(d) Pipelined CCDFG 
with two bubbles in the 
second transaction
(e) Pipelined CCDFG 
with one bubble in 
the third transaction
S1
S2 S1
S3 S2 S1
S3 S2
S3
S1
S2
S3
S1
S2
S3 S1
S2
S3
S1
S2
S3
S2
S3
S1
S1
S2 S1
S3 S2
S3 S1
S2
S3
Figure 6.4: Pipelined CCDFGs for dierent bubble insertion scenarios
57
Our function pipelining transform algorithm takes a CCDFG before pipelining
G and certain pipeline parameters to generate a functional pipelined CCDFG G0.
Checking the equivalence between CCDFG G and RTL is equivalently translated to
equivalence checking between pipelined CCDFG G0 and RTL. CCDFG G0 allows
operations to execute in parallel, closely corresponding with the RTL through
careful modeling of bubble insertion. Thus, we can leverage the existing SEC
approach to check CCDFG G0 and RTL.
We focus on the pipelines which satisfy the following requirements:
 All sub-functions have been fully inlined.
 All loops have been fully unrolled.
 No global variables (other than static variables).
Our framework actually supports loops and sub-functions by extending the ap-
proach discussed here with compositional reasoning; we do not discuss that exten-
sion in the dissertation. Global variables can be avoided by explicitly rewriting as
static variables plus corresponding interfaces.
6.3.1 Algorithm to build Reference Model
As a pedagogical simplication, assume rst there is no branch among scheduling
steps, but allow branches inside scheduling steps. Note that if CCDFG G has
branches, we can merge the destination scheduling steps into one single schedul-
ing step; thus the branch between scheduling steps is equivalently converted into
a branch inside a scheduling step. Without considering branches, we can view
CCDFG G as a sequence of scheduling steps from the entry step to the exit step.
Task Interval is an important metric to measure the performance of function
pipelines: it is the number of clock cycles that must elapse between two trans-
actions. We can partition CCDFG G into a sequence of sub-CCDFGs according
58
to task interval I. Each sub-CCDFG is called a pipeline unit, which is dened in
Denition 5. All scheduling steps within one pipeline unit execute sequentially,
and dierent pipeline units can execute in parallel. In the example shown in Fig-
ure 6.2 (b), because the pipeline can start a new transaction every clock cycle,
each scheduling step is a pipeline unit.
Denition 5 Pipeline Unit. Given a pipeline task interval I and a CCDFG G ,
hGCD;M; T i, T can be partitioned into a set of sub-schedule fT0; T1; : : : ; Tng. Each
Ti takes I clock cycles (except possibly the last partitioned schedule Tn which
may be less than I). Therefore, G can be partitioned into a set of sub-CCDFGs
fG0; G1; : : : ; Gng, respectively. Gi , hGCD;M; Tii is called a pipeline unit.
Algorithm 6 GeneratePipeUnits (G = hS;E;Mi, I, N)
1: P  ;; i 0
2: while i  N do
3: S 0  ;; pos 0
4: while i+ pos  N do
5: s S[i+ pos]
6: S 0  S 0 [ s; pos pos+ 1
7: end while
8: p buildP ipelineUnit(S 0; E;M)
9: P  P [ p; i i+ I
10: end while
11: return P
Algorithm 6 describes the process of partitioning CCDFG G into a set of
pipeline units P . Here, G is described as the triple hS;E;Mi where S and E
denote the nodes and edges of the projection of G on T , and M is the set of mi-
crostep partitions rened by T . Here I is the task interval, and N is the number of
59
S1
S2
S3
S’1
S3 S2 S1
c3 c2 c1
c3 =c2;
c2 = c1;
Y
N
Y Y
N
N
S’1Start
c1= 1 c1= 0
Y
N
exit
Y
N
!c2&!c1
Figure 6.5: Input and output CCDFGs of function pipelining transformation
scheduling steps in G, which is same as pipeline's latency. We use sk = S[k] to rep-
resent the i-th scheduling step in CCDFG G. The algorithm works by traversing
G from the entry step s0, and creating one pipeline unit p for each group of I con-
secutive scheduling steps. Lines 4-7 implement the process of collecting scheduling
steps for one pipeline unit. Subroutine buildP ieplineUnit creates a pipeline unit
p; this process proceeds until we nish the traversal.
Algorithm 7 shows the sequence of high-level steps steps involved in gener-
ating pipelining reference model. It takes CCDFG G, task interval I, and the
number of scheduling steps N . It involves ve steps, viz., (1) inserting pipeline
registers, (2) constructing new scheduling steps, (3) generating new control edges,
(4) restricting control and data ow through guard variables, and (5) implement-
ing data forwarding. We describe these steps in detail below. Figure 6.5 illustrates
the result of applying these steps for our simple example of Figure 6.4(a). The
60
CCDFG on the left is the one before pipelining, the CCDFG on the right is the
generated pipelined reference model, and the gure in the middle shows how the
scheduling steps of the pipeline correspond to those in the original.
Algorithm 7 buildPipeline(G = hS;E;Mi, I, N)
1: /*rst, generate pipeline register*/
2: hS 01;M 01i  GenerateP ipelineRegs(S;M;E; I)
3: /*second, build pipelined scheduling steps*/
4: S 02  GenerateFuncSchedulingSteps(S 01; I; N)
5: /*third, generate new control graph edges */
6: E 01  GenerateFuncEdges(S 02; E; I;N)
7: /*fourth, insert pipeline guard variables*/
8: hS 03;M 02; E 02i  GenerateGuardCond(S 02;M 01; E 01; I)
9: /*fth, generate forwarding*/
10: hS 04;M 02i  GenerateFuncForwarding(S 03;M 02; E 02; I)
11: return G0 = hS 04; E 02;M 02i
Inserting Pipeline Registers. Since the pipeline can accept new inputs before
the previous one nishes, it may need extra registers to store the intermediate
value to prevent variables from being overwritten. Algorithm 8 describes how the
pipelined registers are introduced in the pipelined CCDFG. The basic idea is to
insert temporary variables to mimic pipeline registers. Subroutine getAllV ars re-
turns all variables in CCDFG G. Subroutine needP ipelineReg checks the necessity
of all variables by comparing the life time lv for each variable v with I. The life
time of a variable is the distance between its producer msp and the last consumer
msc. If lv is greater than I, pipeline registers for v are required, otherwise, not
necessary. Equivalently, if msp and msc belongs to two dierent pipeline unit-
s and these two are not consecutive, a pipeline register is required. Subroutine
addP ipelineReg creates pipeline register variables and propagates the value from
61
store %tmp @ a
%res = mul %tmp %pipereg
ret %res
%tmp = add %b %in2
%pipereg = %b
S1
S2
S3
%a_1 = load @ a
%b = mul %a_1 %in1
Figure 6.6: Generate pipeline registers
msp and msc along pipeline register variables. The number of pipeline registers
required is determined by how many pipeline units exist between msp and msc.
Figure 6.6 shows a pipeline register inserted for variable b. The producer is in the
rst scheduling step, and the consumer is in the third step; a pipeline register is
required in the middle step to prevent it from being overwritten after pipelining.
62
Algorithm 8 GenerateFuncPipelineRegs (S, M , E, I)
1: V  getAllV ars(M)
2: for each variable v in V do
3: msp  getProducer(v)
4: msc  getLastComsumer(v)
5: if needP ipelineReg(msp;msc; I) then
6: hS 0;M 0i  addP ipelineReg(S 0;M 0; E;msp;msc)
7: end if
8: end for
9: return hS 0;M 0i
Constructing Scheduling Steps. In the pipelined CCDFG G0, a scheduling step
s0 consists of multiple scheduling steps of CCDFG G. All steps in s0 can execute
and nish within one clock cycle. A key step for constructing scheduling steps for
pipelined CCDFG is to correctly group scheduling steps from CCDFG G. The
grouping result, according the pipeline parameters provided by behavioral synthe-
sis, should match the behavior of the synthesized pipeline. To achieve this, for the
ith scheduling step s0 in G0, we collect the ith scheduling step from all pipeline
units. Scheduling step s0 then must maintain the following two properties: (1) Let
 and  be any two scheduling steps in G collected to execute in s0; then  and
 must belong to dierent pipeline units. (2) Every pipeline unit (except possibly
the last) must have some scheduling step in s0. Algorithm 9 shows our approach to
construct scheduling steps. Subroutine getP ipeUnits returns pipeline units gener-
ated by generateP ipeUnit. Lines 6-10 collect scheduling steps from pipeline units.
We then generate control/data edges between those steps for scheduling step s0 as
shown in line 14-19. The edge is from left to right, because the scheduling steps in
left are running the transaction entered the pipeline early. Subroutine buildEdge
63
creates the edges between two scheduling steps and subroutine appendEdge cre-
ate a new scheduling step which includes edge e. Figure 6.7 show the pipelined
scheduling steps for the simple example. In this example, I equals to one, there-
fore there is only one scheduling step in G0 and this scheduling step consists of all
scheduling steps before pipelining.
Algorithm 9 GenerateFuncSchedulingSteps (S, I, N)
1: P  getP ipeUnits(); S 0  ;
2: /*collect scheduling steps from pipeline units*/
3: for each i in I do
4: s0i  ;
5: for each p in P do
6: if length(p)  i then
7: s0i  s0i [ p[i]
8: end if
9: end for
10: S 0  S 0 [ s0i
11: end for
12: /*build new edges within one single scheduling step */
13: for each step s0 in S 0 do
14: for each consecutive step pair (s0[k]; s0[k + 1]) in s0 do
15: e0  buildEdge(s0[k]; s0[k + 1])
16: s0  appendEdge(s0; e0)
17: end for
18: end for
19: return S 0
Building Edges. Algorithm 10 shows the construction of edges governing the
control ow of the pipelined CCDFG. Lines 3-6 show the construction of edges
64
S1
S2
S3
S3 S2 S1
Figure 6.7: Construction of scheduling steps and edges
between scheduling steps of the pipelined CCDFG G0. Besides, a back edge is
generated from the last scheduling step to the rst scheduling step. The pipelined
CCDFG G0 is formed as a loop. Figure 6.7 shows the edges between scheduling
steps in CCDFG G0.
Algorithm 10 GenerateFuncEdges (S 0, E, I, N)
1: E 0  ;
2: /*build the edges between new scheduling steps*/
3: for each consecutive step pair(S 0[i]; S 0[i+ 1]) in S 0 do
4: e0  buildEdge(S 0[i]; S 0[i+ 1])
5: E 0  E 0 [ e0
6: end for
7: /*build the back edge*/
8: ssrc  S 0[I   1]; sdst  S 0[0]
9: ebackedge  buildEdge(ssrc; sdst)
10: E 0  E 0 [ ebackedge
11: return E 0
Generating Guard Variables. Note that the pipelined CCDFG G0 must be a
65
S3 S2 S1
c3 c2 c1
c3 =c2;
c2 = c1;
Y
N
Y Y
N
N
S’1Start
c1= 1 c1= 0
Y
N
exit
Y
N
!c2&!c1
Figure 6.8: Insert guard variables and assignment
loop since it must be able to initiate an arbitrary number of function invocations as
determined by the upstream logic. Guard variables guarantee that the execution
of this loop corresponding to each function invocation follows the control ow of
the original CCDFG G and terminates properly. Algorithm 11 describes details
of guard variable insertion, and Figure 6.8 illustrates it with a simple example.
First, subroutine createGuardV ariable creates guard variables c1; c2; : : : ; cn for all
pipeline units. For each scheduling step s in CCDFG G, subroutine insertGuard
inserts a branch operation before entering it. If the guard variable is true, this
scheduling step is enabled, otherwise it is skipped. After executing one pipeline
66
unit, we propagate the value of the guard variable to its successor in the sequence.
Recall from above that G0 is a loop; One loop iteration executes and nishes all
pipeline units simultaneously. The assignment of rst guard variable c1 depend-
s on start signal. Guard variables are propagated right before the back edge.
The assignment and prorogation of guard variables are constructed by subroutine
genV arAssign. Figure 6.8 shows the guard variables in an example. We need to
specially handle the exit of pipelined CCDFG. CCDFG G0 can only exit when cn
is true and c1; : : : ; cn 1 all are false, which indicates the pipeline only has one last
transaction running and this transaction is going to exit. In pipelined CCDFG,
we rene the semantics of ret operation. Operation ret denes the end of one
transaction instead of whole CCDFG, and generates an output if it is not return-
ing void. We introduce a new operation exit to denote the termination of the
pipelined CCDFG, which is gated by guarded variables. Subroutine insertExitOp
inserts this gated exit operation.
Note that the pipelined CCDFG must permit overlapped execution of all the
pipeline stages. If start asserts, the rst guard variable c1 is assigned true when
entering the loop in the pipelined CCDFG. During the execution of the rst itera-
tion of the loop body, only those scheduling steps guarded by c1 are enabled. The
other steps are skipped. Consider the example shown in Figure 6.8. In the rst
loop iteration, s1 is enabled and s2 and s3 are disabled. At the end of the rst loop
iteration, guard variable c2 receives the enable token propagated from c1, c3 remain
false. In the second iteration, c1 still remains true, because there is a second start
request. Both s1 and s2 are enabled in the second iteration. This process proceeds
until all guard variables c1; : : : ; c3 are true, pipeline enters a pipeline full stage. In
pipeline full stage, when a new transaction is started, one early transaction nishes
at the same time. When pipeline is in full stage and there is no start signal any
more, the pipeline starts to ush. Guard variable c1 is assigned to false due to no
67
start. Thus, s2 and s3 are enable and s1 is disable. The disable token is propagat-
ed between guard variables every loop iteration. If all guard variables are false,
except the last one c3, the pipelined CCDFG nishes the execution by executing
exit operation. We can easily insert bubbles into pipelines by toggling start signal.
Algorithm 11 GenerateGuardCond (S, E, M)
1: S 0  S; E 0  E; M 0  M
2: P = getP ipeUnits()
3: C = createGuardV ariable(P )
4: /*generate guard condition for each scheduling steps*/
5: for each pipeline unit p in P do
6: for each scheduling step s in p do
7: c getGuardV ar(p)
8: hS 0;M 0; E 0i = insertGuard(s; p; C; S 0; E 0;M 0)
9: end for
10: end for
11: /*generate assignment for guard variables*/
12: hS 0;M 0; E 0i  genV arAssign(C; S 0;M 0; E 0)
13: /*insert exit operation*/
14: hS 0; E 0i  insertExitOp(msret; C; S 0; E 0)
15: return hS 0;M 0; E 0i
Implementing Pipeline Forwarding. The last step in the construction is to
implement pipeline forwarding. In function pipelines, the dependencies between
transactions are introduced by global or static variables. The forwarding can be
implicitly implemented by mapping static/global variables to hardware register-
s which form feedback paths. In CCDFG, the operations to fetch or store data
to registers are represented by load and store, respectively. Algorithm 12 de-
scribes details about constructing forwarding for a pipelined CCDFG. Subroutine
68
c1
Start
c1= 1 c1= 0
Y
N
store %tmp @ a
%res = mul %tmp %pipereg %tmp = add %b %in2
%pipereg = %tmp1
S1S2
S3
%a_1 = load @ a
%f = select %c2 %tmp %a_1
c2c3
%b = mul %f %in1
exit
c3 = c2;
c2 = c1;
Y
N
!c2&!c1
Forwarding
Y Y Y
N N
N
ret %res
Figure 6.9: Final pipelined CCDFG
findAllForwarding returns all pairs of operations which may need forwarding by
checking load and store pairs. However, in order to achieve the best performance,
behavioral synthesis may generate a combinational path to forward the data di-
rectly. For instance, in the example shown in Figure 6.10, the output of adder
has been directly forwarded to the next transaction's multiplier without passing
through the register, otherwise the synthesize RTL cannot accept new data every
clock cycle. To mimic this combinational path, we make sure there is a valid da-
ta path from the forwarding source operation to the destination in the pipelined
CCDFG. Absence of such a path is an indication of data hazard. Our checking
reports errors. However, the forwarding path may vary depending on bubbles in
the pipeline. If the adder is disabled due to bubbles , the data forwarded to the
multiplier by the combinational path is invalid. Instead, the correct data should
69
MUL ADD MUL
MUL ADD MUL
Forwarding
Figure 6.10: Waveform of pipeline forwarding
come from the register. To handle this complication, we determine the forward-
ing path by checking whether the source operation is enabled. This check can be
done by checking its guard variable. We insert a select operation to implement
the check. Figure 6.9 shows the details about the forwarding between adder and
multiplier.
Algorithm 12 GenerateFuncForwarding (S, M , E, I)
1: Dlc  findAllForwarding(S;M;E)
2: S 0  S;M 0  M
3: for each pair (ow, or) in Dlc do
4: if checkForwarding(or; I; S
0) then
5: hS 0;M 0E 0i  insertSelect(ow; or; S 0; E;M 0)
6: else
7: return ERROR
8: end if
9: end for
10: return hS 0;M 0i
70
6.3.2 SEC between CCDFGs and the RTL
Recall from Section 6.2 that handling bubbles is a major hurdle for certifying
pipelines. Bubbles in pipelines aect the behavior: (1) the idle operations are
disabled; (2) the pipeline forwarding has dierent paths. These two are modeled
in pipelined CCDFG by introducing guard variables. Recall however, that in order
to fully check the behavior of pipelines with bubbles, we need to run SEC on all
input sequences combinations.
We implement an approach which only runs the check once. We utilize our
guard variables to encode all possible input combinations. This introduces the
proof obligation that the execution of the transactions already in pipeline do not
aect the execution of a new transaction. A new transaction can start at an arbi-
trary state, with the the pipeline full, empty, or containing bubbles. We model the
pipeline at dierent states by toggling guard variables. For instance, the execution
shown in Figure 6.4 (c) can be modeled as assigning c1 = true, c2 = false, and
c3 = true.
Our SEC then has the following three steps:
 Set the pipelined CCDFG to be a symbolic state which starts a new trans-
action. Setting the pipeline at arbitrary state is done by encoding guard
variables. We set c1 to true and assign symbols to c2; : : : ; cn;
 Set the FSM of RTL circuit to the same symbolic state as the CCDFG.
The structure of circuit's FSM can be obtained from reports obtained from
behavioral synthesis; we analyze these reports to determine the corresponding
symbolic states for CCDFG and RTL.
 Feed the same input symbolic data set on CCDFG and RTL, then run dual-
rail symbolic simulation between them for a single transaction. The proof
obligation is that the output of pipelined CCDFGs and RTL implantations
are equivalent.
71
Before running SEC, we assume the CCDFG and the RTL are equivalent.
SEC checks whether the equivalence is still maintained after executed one trans-
action. The existing SEC approach can be applied directly. Furthermore, because
the mapping between CCDFG's operation and the RTL's functional units is still
maintained, we can apply cutpoint to further improve the scalability.
6.4 EXPERIMENTAL RESULTS
We have implemented the reference function pipelining transformation on top of
our framework for behavioral synthesis certication. SEC is implemented by a
cycle-by-cycle dual-rail, word-level symbolic simulation between CCDFG and RTL,
with support for cutpoint optimization. We employ CVC3 as the SMT solver. We
have applied our tool to a collection of function pipelined designs synthesized by
AutoESL. These designs are carefully selected from several dierent application
domains. For instance, TEA and XTEA are cryptography algorithms, which have
complex bitwise operations, such as shift and bitwise OR. The FIR lter is a signal
processing design with internal feedback. Behavioral synthesis utilizes forwarding
to optimize this feedback path. All loops in these designs have been fully unrolled.
The experiments were conducted on a workstation with 3GHz Intel Xeon processor
with 2GB memory. We set the running time bound to two hours.
Table 6.1 illustrates our experimental results. We rst conducted brute-force
SEC between the un-pipelined CCDFG and the RTL on all designs. None of
the runs terminated within the time bound. We then conducted brute-force SEC
between the pipelined CCDFG and the RTL. Only the run on FIR nished while
the others timed out. With the cutpoint optimization applied, SEC succeeded on
all designs with modest time and memory usages. Column Cuts shows the number
of cutpoints identied for each design. The experiments demonstrate our function
pipelining transformation preserved the internal mapping between CCDFG and
RTL which eectively enables cutpoints. Our approach has successfully veried
72
designs with function pipelines involving several thousand lines of synthesized RTL
with reasonable time and memory usages.
As discussed in Section 6.2, a naive solution to handle bubbles is to enumerate
all possible input combinations to build corresponding CCDFGs. Unfortunately,
the number of such CCDFGs is exponential to the number of scheduling steps and
task interval. Given a CCDFG with scheduling steps N , and task interval I. There
exists 2N=I 1 pipelined CCDFGs. This naive approach is clearly impractical. For
instance, TEA has 43 scheduling steps and task interval is 1, we have to build 242
CCDFGs and run SEC this many times. In our approach, we creatively encode
all possible CCDFGs into a single one by assigning symbols to guard variables.
We employ SMT solvers to symbolically explore all possible solutions rather than
explicit enumeration. This dramatically improve the eciency and scalability.
73
T
ab
le
6.
1:
F
u
n
ct
io
n

p
el
in
in
g
ex
p
er
im
en
ta
l
re
su
lt
s
D
es
ig
n
R
T
L
A
p
p
.
D
om
ai
n
F
u
n
c
In
fo
.
P
ip
el
in
e
In
fo
.
W
it
h
o
u
t
O
p
t.
W
it
h
O
p
t.
#
li
n
e
In
te
r-
D
ep
th
O
p
er
-
F
o
rw
-
P
ip
el
in
e
M
em
.
T
im
e
M
em
.
T
im
e
#
C
u
ts
va
l
a
ti
o
n
s
a
rd
in
g
R
eg
is
te
r
(M
B
)
(S
ec
)
(M
B
)
(S
ec
)
F
IR
43
0
S
ig
n
al
p
ro
ce
ss
in
g
1
5
2
1
1
5
4
3
3
4
.8
3
1
1
1
.5
13
D
C
T
94
1
S
ig
n
al
p
ro
ce
ss
in
g
1
4
4
8
0
1
-
-
1
3
5
2
6.
37
32
C
O
R
D
IC
14
50
D
at
a
p
ro
ce
ss
in
g
1
1
2
1
7
0
0
1
0
-
-
2
2
1
3
8.
83
73
X
T
E
A
17
77
C
ry
p
to
gr
ap
h
y
1
3
2
1
9
2
0
1
4
7
-
-
1
1
4
3
0.
57
32
T
E
A
23
25
C
ry
p
to
gr
ap
h
y
1
4
3
1
9
2
0
2
1
1
-
-
1
0
0
4
0.
39
85
Y
U
V
T
O
R
G
B
24
12
Im
ag
e
p
ro
ce
ss
in
g
1
5
9
6
0
4
-
-
3
3
3
2
5
1
.6
2
48
M
em
or
y
O
p
41
06
M
em
or
y
op
er
at
io
n
2
3
9
9
6
1
7
5
-
-
4
3
8
9.
53
75
74
Chapter 7
CONCLUSION AND FUTURE WORK
7.1 SUMMARY OF CONTRIBUTIONS
Equivalence checking is highly desired to provide condence in the correctness
of behavioral synthesis. This dissertation research has developed a practical and
scalable SEC framework for certifying behavioral synthesis ows. This framework
successfully addressed the three major challenges: (1) close the signicant semantic
gap between ESL and RTL, (2) scale to industry designs, (3) verify the correctness
of loop and function pipelines.
To address the above challenges, we have introduced CCDFG as the intermedi-
ate design representation, which presumes the design's control and data ow and
augments with a schedule. Our SEC algorithm is based on word-level dual-rail
symbolic simulation for comparing CCDFG and RTL. The scalability has been
dramatically improved by employing SMT solvers as the decision engine. We have
developed three eective optimizations targeting dierent design features. The
optimizations exploit the high-level structure of the ESL description to further
ameliorate verication complexity. The experimental results have demonstrated
that our optimized SEC framework is capable of verifying designs with tens of t-
housands of lines of RTL synthesized by a state-of-the-art behavioral synthesis tool.
We have also developed approaches to certifying behaviorally synthesized loop and
function pipelines. The crucial steps here are to develop reference pipelining trans-
formations for loop pipelining and function pipelining. Thus, we can apply SEC
between the reference model and synthesized RTL. The key insight is that the
75
parameterized, synthesis-guided pipelining reference transformations on CCDFG
permits comparison with RTL even after mappings with the original sequential
specication has been destroyed by loop and function pipelining. The mapping
between behavioral level operations and RTL functional units is still preserved;
therefore some key optimizations are applicable. To reduce the complexity brought
by \bubbles", which is a specic diculty in function pipelines, we have encoded
all possible bubble insertions into one single CCDFG. Therefore, we can employ
SMT solvers to symbolically explore all possible solutions rather than explicit enu-
meration. The experimental results have shown that our approaches are ecient
and scalable to apply on various synthesized pipelines from dierent application
domains.
7.2 FUTURE RESEARCH DIRECTIONS
This dissertation has developed an SEC framework for certifying behavioral syn-
thesis ows. However, there are many aspects of behavioral synthesis that have
not been explored. This section discusses several future research directions.
7.2.1 Hierarchical Function Pipelines
We have presented our approach to certifying function pipelines generated by be-
haviorial synthesis in Chapter 6. One assumption of our approach is that all
sun-functions have been fully inlined. However, inlining sub-functions brings two
major drawbacks: (1) redundant checks for sub-functions which are invoked mul-
tiple times, (2) the complexity of SEC increases exponentially. An ideal solution
is to enable compositional reasoning. The modular analysis approach proposed in
Section 4.4 is not applicable in function pipelines. The reason is that we repre-
sent the pre-certied sub-functions as uninterpreted functions, which are untimed.
However, in order to verify the overlapping execution of pipelined sub-functions,
76
we need timing information. Therefore, how to introduce a timed uninterpreted
function model is crucial for certifying hierarchical function pipelines.
7.2.2 Verication of Behaviorally Synthesized Interfaces
Behavioral synthesis can automatically translate high-level interfaces into RTL
interfaces which have various communication protocols. For instance, given a ESL
design specied in a synthesizable subsets of C/C++, which has an output pointer
in its parameter list. This pointer can be synthesized into a First In, First Out
(FIFO) interface in the RTL. The FIFO interfaces need to follow the corresponding
protocols, such as the write signal can be asserted until FIFO is not full (full signal
is low). Therefore, certication of the synthesized interfaces is also a key part to
proof the correctness of the synthesized RTL implementation.
7.2.3 SEC for Compiler Transformations in Behaviorial Synthesis
We proposed an approach to certifying compiler transformations in behavioral syn-
thesis once and for all using theorem proving [59]. The cost of a monolithic proof
is mitigated by the reusability of the transformation over dierent designs. This
approach requires a comprehensive knowledge of the algorithms of transformations
employed by behaviorial synthesis tools. However, they may be not available, espe-
cially for those commercial tools. An alternative solution is to utilize SEC to verify
the correctness of compiler transformations by checking the equivalence between
input and output of each instance. Therefore, we do not need to require internal
algorithms in behavioral synthesis. However, because the variable mappings may
not be preserved between input and output, how to develop eective optimizations
to make this approach scalable is quite challenging.
77
REFERENCES
[1] T. Ball and S. K. Rajamani. Automatically validating temporal safety prop-
erties of interfaces. In Proceedings of the 8th international SPIN workshop on
Model checking of software, SPIN '01, pages 103{122, New York, NY, USA,
2001. Springer-Verlag New York, Inc.
[2] C. Barrett, R. Sebastiani, S. Seshia, and C. Tinelli. Satisability Modulo
Theories. IOS Press, 2009.
[3] C. Barrett and C. Tinelli. Cvc3. In Proceedings of the 19th international
conference on Computer aided verication, CAV'07, pages 298{302, Berlin,
Heidelberg, 2007. Springer-Verlag.
[4] J. Baumgartner, H. Mony, V. Paruthi, R. Kanzelman, and G. Janssen. S-
calable sequential equivalence checking across arbitrary design transforma-
tions. In Proceedings of the 2006 IEEE/ACM International Conference on
Computer-Aided Design. IEEE Computer Society, 2006.
[5] C. Berman and L. Trevillyan. Functional comparison of logic designs for
vlsicircuits. In International Conference on Computer-Aided Design, pages
456{459, 1989. USA.
[6] A. Biere, A. Cimatti, E. M. Clarke, and Y. Zhu. Symbolic model checking
without bdds. In Proceedings of the 5th International Conference on Tools and
Algorithms for Construction and Analysis of Systems, pages 193{207, London,
UK, 1999. Springer-Verlag.
78
[7] D. Brand. Verication of large synthesized designs. In Proceedings of the
1993 IEEE/ACM international conference on Computer-aided design, pages
534{537, Los Alamitos, CA, USA, 1993. IEEE Computer Society Press.
[8] R. E. Bryant. Symbolic manipulation of boolean functions using a graphical
representation. In Proceedings of the 22nd Design Automation Conference,
pages 688{694. IEEE Computer Society Press, 1985.
[9] R. E. Bryant. A methodology for hardware verication based on logic simu-
lation. Journal of the ACM, 38(2):299{328, 1991.
[10] J. R. Burch and D. L. Dill. Automatic verication of pipelined microprocessor
control. In Proceedings of the 6th International Conference on Computer Aided
Verication, CAV '94, pages 68{80, London, UK, UK, 1994. Springer-Verlag.
[11] Cadence. C-to-Silicon Compiler User Guide, 2012.
[12] Calypto Design Systems Inc. http://www.calypto.com.
[13] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. H. Anderson,
S. Brown, and T. Czajkowski. Legup: high-level synthesis for fpga-based
processor/accelerator systems. In Proceedings of the 19th ACM/SIGDA in-
ternational symposium on Field programmable gate arrays, FPGA '11, pages
33{36, New York, NY, USA, 2011. ACM.
[14] R. O. Chapman. Veried high-level synthesis. PhD thesis, Ithaca, NY, USA,
1994.
[15] P. Chauhan, D. Goyal, G. Hasteer, A. Mathur, and N. Sharma. Non-cycle-
accurate sequential equivalence checking. In Proceedings of the 46th Annual
Design Automation Conference, pages 460{465, New York, NY, USA, 2009.
ACM.
79
[16] K.-T. Cheng and V. D. Agrawal. Unied Methods for VLSI Simulation and
Test Generation. Kluwer Academic Publishers, 1989.
[17] L. Claesen, M. Genoe, and E. Verlind. Implementation/specication verica-
tion by means of SFG-Tracing. In CHARME, 1993.
[18] E. Clarke, D. Kroening, and K. Yorav. Behavioral consistency of c and verilog
programs using bounded model checking. In Proceedings of the 40th annual
Design Automation Conference, pages 368{371, New York, NY, USA, 2003.
ACM.
[19] S. A. Cook. The complexity of theorem-proving procedures. In Proceedings
of the third annual ACM symposium on Theory of computing, pages 151{158,
New York, NY, USA, 1971. ACM.
[20] B. Dutertre and L. de Moura. A fast linear-arithmetic solver for dpll(t). In
Proceedings of the 18th international conference on Computer Aided Verica-
tion, pages 81{94, Berlin, Heidelberg, 2006. Springer-Verlag.
[21] X. Feng, A. J. Hu, and J. Yang. Partitioned model checking from software
specications. In Proceedings of the 2005 Asia and South Pacic Design Au-
tomation Conference, ASP-DAC '05, pages 583{587, New York, NY, USA,
2005. ACM.
[22] R. Floyd. Assigning Meanings to Programs. In Mathematical Aspects of
Computer Science, Proc. of Symposia in Applied Mathematics, 1967.
[23] Forte Design Systems. Cynthesizer Manual, 2012.
[24] M. Fujita, H. Fujisawa, and N. Kawato. Evaluations and improvements of a
boolean comparison program based on binary decision diagrams. In Proceed-
ings of the 1988 IEEE/ACM International Conference on Computer-Aided
Design, pages 2{5. IEEE Computer Society Press, 1988.
80
[25] D. Gajski, N. D. Dutt, A. Wu, and S. Lin. High Level Synthesis: Introduction
to Chip and System Design. Kluwer Academic Publishers, 1993.
[26] V. Ganesh and D. L. Dill. A decision procedure for bit-vectors and arrays. In
Proceedings of the 19th international conference on Computer aided verica-
tion, pages 519{531, Berlin, Heidelberg, 2007. Springer-Verlag.
[27] E. Giunchiglia and A. Tacchella, editors. Theory and Applications of Satis-
ability Testing, volume 2919. Springer, 2004.
[28] M. Gordon, J. Iyoda, S. Owens, and K. Slind. Automatic formal synthesis
of hardware from higher order logic. Electron. Notes Theor. Comput. Sci.,
145:27{43, Jan. 2006.
[29] K. Hao, S. Ray, and F. Xie. Equivalence checking for behaviorally synthesized
pipelines. In Proceedings of the 49th Annual Design Automation Conference,
pages 344{349, New York, NY, USA, 2012. ACM.
[30] K. Hao, F. Xie, S. Ray, and J. Yang. Optimizing equivalence checking for
behavioral synthesis. In Proceedings of the Conference on Design, Automation
and Test in Europe, pages 1500{1505, 3001 Leuven, Belgium, Belgium, 2010.
European Design and Automation Association.
[31] C. A. R. Hoare. An axiomatic basis for computer programming. volume 12,
pages 576{580. ACM, New York, NY, USA, Oct. 1969.
[32] C. A. R. Hoare. Communicating Sequential Processes. Prentice-Hall, 1985.
[33] A. J. Hu. High-level vs. RTL combinational equivalence: An introduction. In
Proceedings of the 2006 IEEE/ACM International Conference on Computer-
Aided Design, pages 274{279. IEEE Computer Society, 2006.
81
[34] M. Huth and M. Ryan. Logic in Computer Science. Cambridge University
Press, 2006.
[35] S.-W. Jeong, B. Plessier, G. D. Hachtel, and F. Somenzi. Variable order-
ing and selection for fsm traversal. In Proceedings of the 1991 IEEE/ACM
International Conference on Computer-Aided Design, pages 476{479. IEEE
Computer Society Press, 1991.
[36] R. Johnson and K. Pingali. Dependence-based program analysis. In Pro-
ceedings of the ACM SIGPLAN 1993 conference on Programming language
design and implementation, PLDI '93, pages 78{89, New York, NY, USA,
1993. ACM.
[37] R. B. Jones, D. L. Dill, and J. R. Burch. Ecient validity checking for pro-
cessor verication. In Proceedings of the 1995 IEEE/ACM international con-
ference on Computer-aided design, ICCAD '95, pages 2{6, Washington, DC,
USA, 1995. IEEE Computer Society.
[38] D. Kaiss, S. Goldenberg, Z. Hanna, and Z. Khasidashvili. Seqver: A se-
quential equivalence verier for hardware designs. In Proceedings of the 2006
IEEE/ACM International Conference on Computer-Aided Design, pages 267{
273. IEEE Computer Society, 2006.
[39] M. Kaufmann, P. Manolios, and J. S. Moore. Computer-Aided Reasoning: An
Approach. Kluwer Academic Publishers, Boston, MA, June 2000.
[40] H. Kautz and B. Selman. Planning as satisability. In Proceedings of the 10th
European conference on Articial intelligence, pages 359{363, New York, NY,
USA, 1992. John Wiley & Sons, Inc.
[41] A. Koelbl, J. R. Burch, and C. Pixley. Memory modeling in esl-rtl equivalence
82
checking. In Proceedings of the 44th annual Design Automation Conference,
pages 205{209, New York, NY, USA, 2007. ACM.
[42] A. Koelbl, R. Jacoby, H. Jain, and C. Pixley. Solver technology for system-
level to rtl equivalence checking. In Proceedings of the Conference on Design,
Automation and Test in Europe, DATE '09, pages 196{201, 3001 Leuven,
Belgium, Belgium, 2009. European Design and Automation Association.
[43] A. Koelbl and C. Pixley. Constructing ecient formal models from high-level
descriptions using symbolic simulation. Int. J. Parallel Program., 33(6):645{
666, Dec. 2005.
[44] D. Kroening and E. Clarke. Checking consistency of c and verilog using
predicate abstraction and induction. In Proceedings of the 2004 IEEE/ACM
International conference on Computer-aided design, pages 66{72, Washington,
DC, USA, 2004. IEEE Computer Society.
[45] A. Kuehlmann and F. Krohm. Equivalence checking using cuts and heaps.
In Proceedings of the 34th annual Design Automation Conference, DAC '97,
pages 263{268, New York, NY, USA, 1997. ACM.
[46] S. Kundu, S. Lerner, and R. Gupta. Validating high-level synthesis. In Pro-
ceedings of the 20th international conference on Computer Aided Verication,
CAV '08, pages 459{472, Berlin, Heidelberg, 2008. Springer-Verlag.
[47] C. Y. Lee. Binary decision programs. Bell System Technical Journal,
38(4):985{999, 1959.
[48] X. Leroy. Formal verication of a realistic compiler. Commun. ACM,
52(7):107{115, July 2009.
[49] J. Levitt and K. Olukotun. A scalable formal verication methodology for
83
pipelined microprocessors. In Proceedings of the 33rd annual Design Automa-
tion Conference, DAC '96, pages 558{563, New York, NY, USA, 1996. ACM.
[50] LLVM Project. The LLVM Compiler Infrastructure. http://llvm.org.
[51] S. Malik, A. Wang, R. K. Brayton, and A. Sangiovanni-Vincentelli. Logic
verication using binary decision diagrams in a logic synthesis environment. In
Proceedings of the 1988 IEEE/ACM International Conference on Computer-
Aided Design, pages 6{9. IEEE Computer Society Press, 1988.
[52] P. Manolios, S. K. Srinivasan, and D. Vroon. Automatic memory reductions
for rtl model verication. In Proceedings of the 2006 IEEE/ACM international
conference on Computer-aided design, pages 786{793, New York, NY, USA,
2006. ACM.
[53] A. Mathur, M. Fujita, E. Clarke, and P. Urard. Functional equivalence veri-
cation tools in high-level synthesis ows. IEEE Des. Test, 26(4):88{95, July
2009.
[54] Mentor Graphics. Catapult C Reference Manual, 2011.
[55] M. W. Moskewicz, C. F. Madigan, Y. Zhao, L. Zhang, and S. Malik. Cha:
engineering an ecient sat solver. In Proceedings of the 38th annual Design
Automation Conference, pages 530{535, New York, NY, USA, 2001. ACM.
[56] G.-J. Nam, K. A. Sakallah, and R. A. Rutenbar. Satisability-based layout
revisited: detailed routing of complex fpgas via search-based boolean sat. In
Proceedings of the 1999 ACM/SIGDA seventh international symposium on
Field programmable gate arrays, pages 167{175, New York, NY, USA, 1999.
ACM.
[57] G. Nelson and D. C. Oppen. Simplication by cooperating decision proce-
dures. ACM Trans. Program. Lang. Syst., 1(2):245{257, Oct. 1979.
84
[58] OCaml. http://caml.inria.fr.
[59] S. Ray, K. Hao, Y. Chen, F. Xie, and J. Yang. Formal verication for high-
assurance behavioral synthesis. In Proceedings of the 7th International Sym-
posium on Automated Technology for Verication and Analysis, ATVA '09,
pages 337{351, Berlin, Heidelberg, 2009. Springer-Verlag.
[60] P. R. Schaumont. A Practical Introduction to Hardware/Software Codesign.
Springer, 2010.
[61] K. Schneider. A veried hardware synthesis of esterel programs. In Proceedings
of the International Workshop on Distributed and Parallel Embedded Systems:
Architecture and Design of Distributed Embedded Systems, DIPES '00, pages
205{214, Deventer, The Netherlands, The Netherlands, 2001. Kluwer, B.V.
[62] C. J. Seger, R. B. Jones, J. W. O'Leary, T. Melham, M. D. Aagaard, C. Bar-
rett, and D. Syme. An industrially eective environment for formal hardware
verication. Trans. Comp.-Aided Des. Integ. Cir. Sys., 24(9):1381{1405, Nov.
2006.
[63] O. Shtrichman. Tuning sat checkers for bounded model checking. In Proceed-
ings of the 12th International Conference on Computer Aided Verication,
pages 480{494, London, UK, 2000. Springer-Verlag.
[64] J. P. M. Silva. Practical applications of boolean satislability. In Proceedings
of the 9th InternationalWorkshop on Discrete Event Systems, pages 74{80.
IEEE, 2008.
[65] J. P. M. Silva and K. A. Sakallah. Grasp: A search algorithm for propositional
satisability. IEEE Trans. Computers, 48(5):506{521, 1999.
[66] N. Sinha. Symbolic program analysis using term rewriting and generalization.
In Proceedings of the 2008 International Conference on Formal Methods in
85
Computer-Aided Design, FMCAD '08, pages 19:1{19:9, Piscataway, NJ, USA,
2008. IEEE Press.
[67] M. Stepp, R. Tate, and S. Lerner. Equality-based translation validator for
llvm. In Proceedings of the 23rd international conference on Computer aided
verication, pages 737{742, Berlin, Heidelberg, 2011. Springer-Verlag.
[68] G. S. Tseitin. On the complexity of derivation in propositional calculus. S-
tudies in Constructive Mathematics and Mathematical Logic, Part II:178{188,
1968.
[69] M. N. Velev and R. E. Bryant. Verication of pipelined microprocessors by
correspondence checking in symbolic ternary simulation. In Proceedings of
the 1998 International Conference on Application of Concurrency to System
Design, CSD '98, pages 200{, Washington, DC, USA, 1998. IEEE Computer
Society.
[70] K. Wakabayashi and H. Tanaka. Global scheduling independent of control de-
pendencies based on condition vectors. In Proceedings of the 29th ACM/IEEE
Design Automation Conference, DAC '92, pages 112{115, Los Alamitos, CA,
USA, 1992. IEEE Computer Society Press.
[71] D. J. Wheeler and R. M. Needham. TEA, a tiny encryption algorithm. In
Fast Software Encryption, pages 363{366, 1994.
[72] Xilinx. AutoESL Reference Manual, 2011.
[73] J. Yang and C.-J. H. Seger. Introduction to generalized symbolic trajectory
evaluation. IEEE Trans. Very Large Scale Integr. Syst., 11(3):345{353, June
2003.
[74] J. Zhao, S. Nagarakatte, M. M. Martin, and S. Zdancewic. Formalizing the
86
llvm intermediate representation for veried program transformations. In Pro-
ceedings of the 39th annual ACM SIGPLAN-SIGACT symposium on Princi-
ples of programming languages, pages 427{440, New York, NY, USA, 2012.
ACM.
