Characterization of asynchronous templates for integration into clocked CAD flows by Stevens, Kenneth & Xu, Yang
2009 15th IEEE Symposium on Asynchronous Circuits and Systems
C h a ra c te riz a tio n  o f  A sy n c h ro n o u s  T em p la tes  
fo r  In te g ra tio n  in to  C lo ck ed  C A D  F lo w s
Kenneth S. Stevens, Yang Xu, and Vikas Vij 
Electrical and Com puter Engineering 
U niversity o f Utah
Abstract—Asynchronous circuit design can result in substantial 
benefits of reduced power, improved performance, and high 
modularity. However, asynchronous design styles are largely 
incompatible with clocked CAD, which has prevented wide-scale 
adoption. The key incompatibility is timing. Thus most commer­
cial work relies on custom CAD or untimed delay-insensitive 
design methodologies. This paper proposes a new methodology, 
based on formal verification and relative timing, to create and 
prove correct necessary constraints to support asynchronous 
design with traditional clocked CAD. These constraints support 
timing driven synthesis, place and route, and behavior and timing 
validation of fully asynchronous designs using traditional clocked 
CAD flows. This flow is demonstrated through a simple example 
pipeline in IBM’s 65nm process showing the ability to retarget 
the design for improved power and performance.
I. I n t r o d u c t io n
Two factors have driven a m ajor shift in the sem iconductor 
industry as a result o f the ever decreasing feature size o f deep 
subm icron technology. First, pow er has em erged as a prim ary 
m etric for all designs, w hether they are are hand held devices 
or desktop machines. Second, the exponential increase in the 
num ber and perform ance of transistors on our chips has grown 
to the point w here m odularity and design reuse is mandatory, 
and efficient global synchronous clocking throughout the chip 
is expensive in term s pow er and design time.
M odular design blocks are easier to integrate, and can 
be m ore pow er efficient if  they operate at variable or 
local optim um s using independent frequencies. C urrent trends 
clearly favor asynchronous design: netw orks o f heterogeneous 
cores that are locally optim ized for pow er and cycle time. 
D ue to these factors the International Technology Roadm ap 
for Sem iconductors predicts that 20% o f designs w ill be 
driven by h a n d sh a ke  c lo ck in g  in 2012, rising to 40% by 2020 
[16]. Exam ple designs that em ploy such methods have shown 
substantial im provem ents in power, perform ance, and latency 
[23], [18].
H andshake clocking relies on asynchronous controllers to 
sequence a traditional “clocked” data pipeline. The formal 
handshake protocols provide the requisite flexibility in fre­
quency and sim plicity o f m odular interfacing. Unfortunately, 
integrating handshake clocking w ith traditional clocked data 
pipelines has proven problem atic [10], [21]. In practice, the 
radical and disruptive paradigm  shift to fully asynchronous 
design has been unsuccessfully attem pted for years. General 
adoption as predicted in the ITRS w ill be unlikely to occur 
w ithout a new approach that supports traditional CA D  flows
and can be used by designers trained in clocked m ethodolo­
gies. W e view the difference in tim ing m ethodologies as the 
prim ary im pedim ent to exploiting traditional clocked CA D  and 
im plem enting handshake clocked designs.
This paper reports on a m ethodology, based on formal 
verification and relative tim ing, that supports tim ing driven 
synthesis, physical design, and pre- and post-layout tim ing 
validation o f handshake clocked designs using traditional 
clocked CAD. This approach enables the general adoption of 
asynchronous or “handshake clocked” circuits in the traditional 
clocked flow. This new flow consists o f fully characterizing the 
asynchronous handshake clocking circuits as desig n  tem pla tes  
that replace the clock tree in a traditional clock design.
II. B a c k g r o u n d
A. R e la ted  W ork
The path to general adoption o f a disruptive technology 
such as asynchronous circuit design is fraught w ith difficulty 
and challenge. o n e  o f the prim ary roadblocks is the CAD 
flow [21], [10]. This poses three problem s for asynchronous 
design. First, clocked C A D  flows are in general incom patible 
w ith seq u en tia l  asynchronous design. Second, clocked CAD 
tools are in general m ore capable than their asynchronous 
cousins. Third, there is a general level o f distrust in the ability 
to correctly and robustly design com m ercial asynchronous 
circuits. The ability to adapt clocked CAD and design flows 
to asynchronous design, and to base asynchronous designs on 
form al proofs o f correctness, are enabling approaches that can 
greatly m itigate the adoption o f asynchronous circuits by the 
general design community.
R ecent research in the asynchronous com m unity has begun 
to achieve m ore industrial acceptance and broader use of 
asynchronous designs by focusing on addressing the CAD 
challenge. This has been achieved by integrating and adopting 
clocked CAD w here advantageous. The goals o f this w ork are 
no different. However, this w ork stands out from  the rest on 
two prim ary fronts. First, the m ethods used in this approach are 
com pletely general to any asynchronous design, and applies 
to bundled data as well as delay-insensitive designs; to any 
protocol, be it two or four phase, dual rail, or single track; and 
to any design flow, including a desynchronization approach or 
full custom  asynchronous design. Second, this approach is not 
beholden to a program m ing language; we assum e adoption of 
today 's de facto standard o f Verilog.
1522-8681/09 $25.00 © 2009 IEEE
DOI 10.1109/ASYNC.2009.26
151 @ computer ^
R elated w ork that is probably the m ost technologically 
advanced and com m ercially successful comes from  Handshake 
Solutions [12]. A  com plete synthesis, layout, and sign-off 
solution for both C adence and Synopsys based design flows 
has been developed [6 ], and includes support for autom atic 
test generation [22]. Constraints and scripts from  a higher 
level language are generated that are supported by the clocked 
CAD. U nfortunately there is little public docum entation on 
the algorithm s or design m ethodology used to generate the 
constraints and im plem ent this flow. The flow is also tightly 
coupled w ith their proprietary program m ing language. A nother 
com m ercial tool flow based on clocked C A D  is from  Theseus 
Logic [5], [9]. This flow supports Verilog design descriptions 
and translates the design to quasi delay-insensitive null con­
vention logic [8 ]. The Theseus flow does not directly support 
bundled-data or other asynchronous methodologies.
D esynchronization is another design approach that uses 
clocked CA D  (as well as starting w ith a clocked design) to 
produce asynchronous circuits [4], [1]. There are a num ber 
of current research and industrial efforts focusing on this 
prom ising flow. D esynchronization supports Verilog and uses a 
tem plate based approach, and algorithm s have been developed 
for test generation [14]. However, this flow does not support 
general asynchronous design, largely due to the low num ber 
of asynchronous tem plates and custom  tools.
There are several other related research efforts to utilize 
clocked CAD. In one example, an autom ated m ethod was 
developed for interconnect network, but it does not support 
matching delays and bundled data [13].
B. G en era lly  A p p lic a b ility
A key difference betw een the approach presented here and 
other w ork is the generality o f the solutions. This work 
supports designs from  clocked, to standard asynchronous 
protocols, to pulse based [15], to wave pipelining designs [24]. 
This technology enables asynchronous designs to be specified 
in industry standard representations such as Verilog, supports 
synthesis w ith ASIC tools such as D esign Compiler, uses 
tim ing driven place and route tools such as IC Com piler or 
M agm a, and can be validated for correctness using Calibre and 
PrimeTim e. This new m ethod w hen m ature w ill not require 
deep expertise in asynchronous theory or circuits design skills. 
D esynchronization is an exam ple o f an approach to develop 
handshake clocked designs [4], [1]. The m ethod presented in 
this paper supports desynchronization but is not lim ited to such 
a m ethodology; indeed it can be applied to any asynchronous 
design.
The key to the generality is the form al approach. Form al 
verification (FV ) is orthogonal to any particular synthesis 
engine or design style. Thus all tim ing methodologies, from  
clocked to delay insensitive (D I), are supported. H ence this 
m ethodology frees designers from  the constraints o f any asyn­
chronous design style (e.g. D I) or custom  design tool flows. 
The verification utilizes relative tim ing (RT), w hich also sup­
ports all classes o f tim ing, from  clocked to fully DI [20]. This 
is im plem ented as follows. Each sequential tem plate (clocked












Fig. 1. Formal Relative Timing Generation and Mapping to static timing 
analysis (STA) Tools
or asynchronous) starts w ith a form al specification. The timing 
constraints that m ust hold, be they quasi delay-insensitive forks 
or m atched bundled data delays, are all form ally derived as 
relative tim ing constraints. These RT constraints are proven 
correct for the system  behavior by the specification. The RT 
constraints are then m apped to constraints that the clocked 
CAD can use for tim ing driven design optim izations (typically 
sd c  constraints). This results in a design that is com pletely 
general and provably correct if  all the constraints hold.
The proposed asynchronous design flow is sim ilar to the 
traditional clocked design flow. C locked design has focused its 
design m ethodology around a single characterized sequential 
circuit: the flip-flop. This w ork extends design to directly 
support any  sequential or asynchronous module in the design 
flow. The asynchronous design modules, such as the flop, 
w ill be em bodied in c ircu it tem p la te s  that have been fully 
characterized w ith FV  and relative tim ing, and can support 
handshake clocking protocols as well as global clocking. These 
tem plates are then used in the design.
W hile relative tim ing is the foundation of this approach 
giving it form al robustness and flexibility, other algorithms 
are necessary to com pletely autom ate this flow. A lgorithm s to 
support the tem plates in ASIC CAD include cycle breaking 
to apply tim ing graphs that are DAGs, synthesis directives 
to ensure the hazard properties o f the tem plates are not 
modified, and conversion o f the tem plate tim ing constraints 
into sd c  form at for support by ASIC tools. Templates w ill be 
designed that support the conversion o f clocked design into 
asynchronous “handshake clocking” . This full flow w ill be 
presented through the sim ple exam ple design.
C. F orm al T im ing a n d  Verification
Relative tim ing can accurately capture, model, and validate 
the relationship betw een heterogeneous tim ing and behavior 
in protocols and general circuit structures, including sequen­
tial asynchronous designs. First, tim ing constraints are m ade 
exp lic it  in designs, rather than use the traditional im plicit rep­
resentations such as a clock frequency. This allows designers 
and tools to specify, understand im plications, and m anipulate 
the tim ing of far m ore general circuit structures and advanced 




A binary relation LC C P  x P  over agents is a logic con­
formation between implementation I  and specification S if
(I, S) £ LC then V a  £ Act and V fi £ A U { t} (outputs 
and t  ) and V 7  £ A (inputs)
(i) Whenever S -^S ' then
3 I '  such that I ==I ' and ( I ', S ') £LC
(ii) Whenever I i I '  then
3 S ' such that S = S ' and (I ', S') £ LC
(iii) Whenever I i I '  and S =  then
3 S ' such that S==S' and (I ', S') £ LC____________
Fig. 2. Bisimilar Logic Conformance Relationship
the perform ance and correctness o f a circuit are transform ed 
into log ica l constraints, rather than into real-valued variables 
or delay ranges. A  com pact representation has been devel­
oped using point-of-divergence (PO D ) to point-of-convergence 
(POD) constraints. The PO D/POC (pod ^  poc 0 ^  poci) 
representation enables m ore efficient search and verification 
algorithm s to be developed w hich greatly  enhances the ability 
to com bine tim ing w ith optim ization, physical placem ent, and 
validation design tools [17]. This approach alters the way 
in w hich tim ing is represented by designers and CAD tools, 
and has been shown to provide significant pow er-perform ance 
advantages in som e circuit designs [18], [2 0 ].
Form al verification and relative tim ing is the key technology 
that perm its tem plates to be characterized in a w ay that is 
com patible w ith clocked CAD. The form al verification uses 
m odel checking. The representation and m ethod o f generating 
RT constraints is shown in Fig. 1. This w ork applies a 
conform ance relation betw een the sp ec ifica tio n  (spec.) and 
im p lem en ta tion  (design) based on the bisim ulation confor­
m ance relation shown in Fig. 2. The form al verification tool 
(RT-FV) proves correctness o f an im plem entation against a 
specification. Timing constraints are represented as logical 
expressions that m ake error states unreachable. A  set of 
constraints can be autom atically generated that restrict the 
tim ing o f the im plem entation such that it conform s to the 
specification [7]. Now tim ing is fully represented in the logical 
and behavioral domain. The constraints are then m apped 
to a form at acceptable by a static tim ing analysis (STA) 
engine, synthesis engine, or place and route engine, such as 
PrimeTim e, D esign Compiler, or SoC encounter.
D. Tem plate B a se d  M eth o d o lo g y
R ather than com pete in the CAD dom ain and develop fully 
independent flows, one can apply com m ercial clocked CAD 
and its associated algorithm s as broadly as possible, and 
restrict custom  tools to the necessary asynchronous circuit 
and verification problem s. This approach, unlike purely asyn­
chronous design, is able to leverage the significant industrial 
investm ent in synchronous design tools. Such a flow is sup­
ported in this paper based on “design tem plates” w hich are 
the asynchronous sequential com ponents o f a design. I f  this 
approach is successful and adopted by industry, designers w ill 
be able to build  asynchronous system s based on the m erits of
the architecture, such as perform ance and power.
This approach to asynchronous design w ith clocked tools 
thus has two facets: (a) the design and characterization of the 
asynchronous tem plates, and (b) traditional system  design that 
employs the asynchronous tem plates. The key to making this 
happen is to develop ch a ra c te rized  tem pla tes  that can be m a­
nipulated and optim ized w hen inserted into the clocked CAD 
flows. The design and characterization o f the asynchronous 
tem plates requires substantial expertise in asynchronous design 
and verification. However, once the tem plates have been 
com pleted, they can be inserted into a design flow by clocked 
designers w ith little expertise in asynchronous design. Thus 
a bulk o f the asynchronous circuit and CAD are restricted to 
off-line library design and characterization.
III. F o r m a l  C h a r a c t e r iz a t io n  F l o w
The characterization o f an asynchronous tem plate is som e­
w hat com plicated, and w ill be dem onstrated on the design 
o f a linear pipeline controller LC. This tem plate is part o f a 
simple design exam ple shown in Fig. 3 that w ill be used in 
the rem ainder o f this paper. There are only two asynchronous 
tem plates in this design, the linear controller (LC) and the Fork 
Join tem plate (F/J). The rest o f the design is synthesized using 
norm al clocked tool flows. We have designed a sm all m icro­
processor using this flow, and this exam ple is a conceptual 
piece o f such a design that calculates the function x 2 +  3x.
A. B u n d le d  D a ta  w ith  C lo cked  D a ta p a th
B undled data asynchronous designs are partitioned into two 
signal classes: the datapath and control. The datapath in Fig. 3 
consist o f the registers (R) and oval boxes im plem enting 
arithm etic functions. The registers are im plem ented as either 
latches or flip-flops. This datapath is synthesized using D esign 
Com piler based on frequency param eters provided by the 
user. The rest o f the design is the control logic -  w hich is 
im plem ented by the clock distribution logic in clocked design. 
To create a “handshake clocked” design, the global clock is 
replaced w ith the control logic. In this case there are four 
instantiations o f the linear controller (LC) and two o f the fork 
jo in  (F/J) module.
The responsibility o f the handshake clocking is to m aintain 
the tim ing and functional relationship betw een data in adjacent 
pipeline stages, im plem enting stalls when necessary. This is 
achieved by im plem enting a handshake protocol in the LC 
blocks. Extra delay m ay be needed betw een the control blocks 
so that the clock signal does not arrive at the flop before the 
input data is valid. H ence a m atched delay will be im plem ented 
betw een the data banks i and i + 1  on the control path. For 
example, the delay from  r 0 to r 00 m ust m atch the x 2 datapath 
from  R 0_q to the input o f R 10.
B. A syn ch ro n o u s Tem plate D esign
N um erous handshake protocols and asynchronous circuit 
designs are feasible realizations for linear pipelines. The 
protocol and circuit design for each tem plate w ill have a large 
im pact on the design in three ways. First, the tem plates directly
153
Fig. 3. Example design: a simple ASIC mathematical pipeline segment computing dout =  x 2 + 3x
la
ck
LEFT =  l r . c 1 . l a . c 2 . l r . l a . L E F T  
R IG HT =  c 1 . r r . c 2 . r a . r r . r a .R I G H T
SPEC =  (LEFT | RIGHT) \{  c1, c2 }
Fig. 5. CCS specification of linear controller
Fig. 4. LC circuit implementation
im pact the perform ance and pow er based on the com plexity 
o f the design and the concurrency o f the protocol. Second, 
the characterization o f the tem plate critically depends on the 
protocol and im plem entation. Finally, the correctness o f the 
system , particularly w ith cyclic pipelines, w ill depend on 
the protocols and storage elem ents em ployed [2]. H ence our 
m ethod supports all tem plates.
The design used for the linear controller in this exam ple 
is shown in Fig. 4. This im plem ents the four-cycle return to 
zero handshake protocol shown in Fig. 5 and 6 as CCS and 
Petri-N et specifications [11], [3]. Our CAD tools support both 
representations. N ote that this is a tim ed protocol (the dashed 
arcs in Fig. 6 constrain inputs), sim ilar to a burst-m ode spec­
ification. Such a protocol is chosen for this exam ple because 
it illustrates the requirem ent o f additional fundam ental mode 
tim ing constraints to guarantee correct im plem entation in a 
design, as com pared to delay-insensitive or speed-independent 
designs. The result o f m apping this design to a Verilog module 
in the Artisan 65nm  IBM  10sf library is shown in Fig. 7.
C. C lo cked  C A D  Tool C onstra in ts
Following are the sd c  constraints supported by com m ercial 
tools that are used in this asynchronous tem plate characteriza­







6 ) set_disable_tim ing
Structural m odifications to a design m ay occur during 
synthesis and place and route flows. These changes result 
in optim izations such as rem oving back-to-back inverters, 
com bining simple gates into a single com plex gate, or breaking
154
module linear_control (lr, la, rr, ra, ck, rst);
input lr, ra, rst;
output la, rr, ck;
INVX1A12TH lc0 (.A(ra), .Y(ra_));
AOI32X1A12TH lc1 (.A0(lr), .A1(ra ), .A2(y ), .B0(lr), .B1(la), .Y(laJ);
INVX1A12TH lc2 (.A(la_), .Y(la));
AOI32X1A12TH lc3 (.A0(ra_), .A1(lr), .A2(y ), .B0(ra ), .B1(rr), .Y(rr ));
NOR2X1A12TH lc4 (.A(rr ), .B(rst), .Y(rr));
c element lc5 (.A(la), .B(rr), .Y(y_));
INVX1A12TH lc6 (.A(la_), .Y(ck));
endmodule // linear_control
set_size_only -all_instances { */lc1 }
set_size_only -all_instances { */lc3 } 
set_size_only -all_instances { */lc4 }
set_size_only -all_instances { */lc5 }
Fig. 8. Size only constraints for the circuit 
of Fig. 7
Fig. 7. Verilog implementation in the 65nm Artisan library
a com plex gate into a set o f sim pler gates. Constraints are used 
to prevent this from  occurring in the asynchronous blocks, 
because it could result in hazards or substantially m odify 
necessary delay properties o f the circuit. The set_size_only 
constraint prevents the tool from  structurally m odifying the 
cell but allows the tools to optim ize the drive strength of 
the cell for pow er and delay optimization. The set_dont_touch 
constraint disallows the tool from  modifying the cell in any 
manner. These com m ands take as argum ents the cell instance 
names. The following com m and disallows structural modifica­
tion o f all lc3 instances (the AOI gate) in the exam ple design.
set_size_only -a ll_ in stan ces { */lc3 }
Traditionally the tools use clock dom ains to optim ize cir­
cuits for pow er and perform ance. They understand setup and 
hold constraints into flops and latches. W hen the sequentials 
are driven from  a sim ple clock dom ain the tools can optim ize 
the com binational logic for the desired frequency. All o f 
these tools operate on directed acyclic graphs, or DAGs. If 
the tim ing graphs have cycles, algorithm s in the CAD tools 
are called to break the cycles. A  user can m anually define 
how to break the tim ing graphs w ith the set_disable_tim ing 
constraint. This w ill rem ove tim ing arc from  a prim itive gate 
(such as a NAND gate) from  the specified input to the specified 
output. By rem oving the tim ing arcs in the prim itive gates 
a m anual instance o f the tim ing graph, and how signals 
propagate through the circuit, can be defined. This is essential 
for sequential circuits that use handshaking since they always 
consist o f cyclical tim ing paths. This com m and takes a -from  
pin nam e, a -to pin nam e and a list o f cells. The following sdc 
com m and disables the tim ing arc from  y_ to rr_  through one 
of the AOI gates in all instantiations o f the linear controller 
in the exam ple design.
X constrained “to” signal
set_disable_tim ing -from A2 -to  Y \ 




Fig. 9. The set_data_check command
set_m in_delay comm and. This com m and has the side effect 
o f breaking the tim ing graph at the two end points o f the 
constraint (sim ilar to the set_disable_tim ing constraint). This 
com m and has several options, but basically takes a -from  set 
o f path start points, a -through set o f points the path m ust pass 
through, and a -to set o f path end points, and the target delay 
value. Relative tim ing constraints can be checked using a pair 
o f com m ands as follows.
set_max_delay 1.7 
set_min_delay 1.7
-from [get_pins R0_reg_latch*/Q] \ 
-to [get_pins R10_reg_latch*/D] 
-rise_from  [get_clocks tk 0 /lr ]  \ 
-rise_ to  [get_pins tk10_lc1/A0]
By default, the m axim um  and m inim um  path delays are 
calculated by considering the clock edge times. Extensions 
to this flow have been im plem ented in the tools to override 
tim ing values, support asynchronous signaling, and tim ing 
dom ains that are not part o f a fixed clock domain. These 
are the set_max_delay, set_min_delay, and set_data_check 
commands.
One can override the tim ing constraints in a clocked dom ain 
w ith a specific tim e value by using the set_m ax_delay or
The first constraint w ill m ake all the paths from  the output of 
register R0 to the input o f register R10 have a m axim um  delay 
of 1.7ns in our exam ple design. The second will constrain the 
m inim um  delay path on the control path to also be 1 .7 ns. 
This path is from  the lr input o f the controller associated w ith 
register R0 to the input o f the linear controller that clocks 
register R10.
The set_data_check com m and is used to check setup or hold 
constraints betw een two unclocked data signals. The -from  
signal is considered to be the “clock” signal (called related) 
and the -to signal is considered to be data signal (called the 
constrained signal). This perform s a the setup check and can 
be given a margin. This is clarified in Fig. 9. Given a relative 
tim ing constraint, the relative ordering o f two signals can be 
m apped into -from  and -to constraints w ith a slack tim e. The 
com m on point o f divergence can be given w ith the -clock 
com m and, as shown:
set_data_check -clock [get_clocks tk 0 /lr ]  \
-fa ll_ from  [get_pins tk0_lc3/A2] \ 
- r ise _ to  [get_pins tk0_lc3/B1] -setup 0.05
This exam ple im plem ents the constraint lr j  ^  rr j  ^  lr j  
where lr j  is the POD specified by -clock, rr j  is the -rise_to 
signal, and lr j  is the -fall_from  signal. This com m and correctly
155
checks the m axim um  delay for the constrained -rise_to signal 
against the m inim um  delay for the related -fall_from  signal, 
w ith a m argin of 50ps.
The com bination o f constraints allow us to utilize the 
synthesis, p lace and route, and tim ing tools to optim ize and 
validate the tim ing o f asynchronous designs
D. Tem plate C hara c ter iza tio n
This section describes the detailed flow required to 
characterized the LC pipeline tem plate.
1) M o d e l G enera tion: The first step in tem plate character­
ization is converting the Verilog m odule (Fig. 7) into an 
equivalent form al representation for verification by m odel 
checking. This transform ation is autom ated to aid in correct­
ness and productivity. The CAD tool takes three inputs: (i) 
the Verilog design o f the tem plate, (ii) a m apping o f Verilog 
gates to form al sem i-m odular description of each gate in CCS, 
and (iii) a functional description o f the gates in the target 
technology (Fig. 10). This code assigns the inputs o f the 
module to boolean values (0  for lr and ra, 1 for rst) and 
simulates the design to calculate the initial voltages for each 
node in the design. The node values are used to select the 
correct initial state for each form al CCS module. CCS has 
been selected for verification because it form ally supports 
verification of nondeterm inism  (arbiters and synchronizers) 
through the sem antics o f the internal t  transitions, giving 
additional applicability o f the flows.
The designer m ust then create a com plete form al specifi­
cation o f the behavior o f the module. This is usually done 
during the design and synthesis procedure. Fig. 5 and 6 show 
two equivalent specifications for LC that our tools currently 
support. This w ork does not use an assum es-guarantees model, 
but rather one that fully specifies the input and output signal 
behavior as can be seen w ith these specifications.
2) V erification a n d  C o n stra in t G enera tion: The im plem en­
tation is then verified against the specification using m odel 
checking. The verification flow is also used to generate the 
tim ing constraints for this design. A n untim ed sem i-m odular 
m odel checking engine using the bisim ulation based confor­
m ance relation o f Fig. 2 is em ployed [19]. The initial verifica­
tion em ploys speed independence semantics. This traditionally 
w ill result in num erous violations, since alm ost every circuit 
requires some tim ing assum ptions, many due to technology 
m apping. For LC seven errors occur. These violations m ust 
be rem oved through relative tim ing constraints that reduce the 
reachability graph of the im plem entation. Four local tim ing 
constraints are sufficient to m ake the im plem entation conform  
to the specification, including: l r j  ^  y _ j ^  r r j  and 
l r j  ^  y_ j  ^  la j .  The first constraint requires that the cycle 
in Fig. 4 from  l r j  to y _ j is faster than the cycle from  lr j  
to l a |  to la j  to la j .  U pon applying these RT constraints the 
design verifies as conform ant to the specification. This first 
speed-independent verification run produces the key tim ing 
constraints for tim ing driven sizing and place and route.
A  second verification run is required to ensure that tim ing 
constraints o f the protocol are correctly generated. The pro­
tocol in this exam ple is a tim ed protocol. This protocol has 
burst-m ode properties w here the outputs la and rr m ust both 
occur before either o f their related causal inputs lr and ra. A 
pipeline of three controllers in series are verified to generate 
the protocol constraints betw een modules. This results in 
two additional fundam ental m ode RT constraints, such as the 
constraint l r j  ^  r r j  ^  lr j .  This requires that the rr signal be 
driven high before the l r j  to la j  to l r j  cycle occurs. These are 
also key constraints that m ust be enforced during the timing 
driven sizing and place and route o f the design.
A th ird  hierarchical verification is run on tem plate speci­
fications and the datapath to generate any tim ing constraints 
betw een the handshake clocking and the datapath logic. W hen 
synthesizing bundled data designs, these runs will create the 
m atched delay constraints betw een datapath and control. This 
produces a num ber o f constraints such as l r j  ^  din ^  la j. 
This ensures that the m inim um  relative delay through the 
control path is larger than the m axim um  delay in the datapath. 
These constraints are necessary to autom atically synthesize the 
matching delays necessary in the pipeline.
The design is finally verified under delay-insensitive con­
ditions where every w ire segm ent outside o f a native library 
gate is given an unbounded delay. The DI m odel norm ally 
generates a copious num ber o f constraints. The fully DI LC 
design adds 2,920,701 violations w ith 967,777 states. A  set of 
eleven m ore tim ing constraints rem ove 1,877 transitions and 
reduce the design to 2,292 states w hich are conform ant to the 
original specification. This concludes the verification aspect of 
tem plate characterization.
3) R T  C onversion  to sd c  C onstra in ts:  The RT constraints 
from  verification are then converted into two classes o f sd c  
constraints: set_data_check constraints and set_max_delay, 
set_m in_delay constraints. These constraints control tim ing 
driven sizing, synthesis, and place and route o f the design. 
C locked CAD tools do a m arvelous job  o f tim ing driven design 
w hen using the m ax and m in delay constraints. However, these 
constraints break the tim ing graphs at the end points o f the 
paths, and are som ewhat particular about w hat can be used as 
an end point. The data check constraints don’t cut the tim ing 
graphs and are not nearly as particular about the end points, 
but can not be relied upon to perform  tim ing driven synthesis 
(such as generating delay elem ents for m in-delay constraints). 
As such a hybrid set o f constraints are used to im prove the 
quality and run-tim e of the tools.
The verification runs betw een the specification and different 
im plem entation models result in three sets o f data check 
constraints as shown in Fig. 12. The s d c  constraints are 
assum ed to lie inside clock domains. The clock path m ust be 
defined to be on the point-of-divergence in the RT constraints. 
In this design the clock dom ains are propagated from  lr signal. 
The sd c  constraints are then m apped to paths that converge 
on two pins o f a single gate instance. For example, the first 
sd c  constraint o f Fig. 12 cam e from  l r j  ^  y _ j ^  la j .  This 
constraint thus ensures the A2 and B1 pins on the AOI gate 
instance lc1 in Fig. 7 (that map to the signals y_ and la) occur 
in the correct order.
156
CCS specification functional descriptions:
function NAND0001 4 d not(a * b * c)
function NOR001 3 c not ( a + b )
function A2B1O2I0001 7 d not((not(a)*b) + c)
function O12A2I0001 6 d not(a * (b + c))
Gate library to CCS specification mapping:
module artisan65nm2ccs ();
NAND3X2A12TR NAND0001 (.A(a), .B(b), .C(c), .Y(d));
NOR2X2A12TR NOR001 (.A(a), .B(b), .Y(c));
AOI2XB1X2A12TR A2B1O2I0001 (.A0(b), .A1N(a), .B0(c), .Y(d));
OAI21X2A12TR O12A2I0001 (.A0(b), .A1(c), .B0(a), .Y(d)); 
endmodule // artisan6 5nm2ccs
Fig. 10. Snippets of the functional cell representation and Verilog to CCS specification mapping. The second and third columns in the functional description 
define the start of signal voltage state section of gate name, and the name of the output. The cell to spec mapping is a Verilog module that maps the design 
(artisan cell) to an instance (the CCS specification).
agent NAND0 01 = a.NANDa01 + b.NAND0b1 ;
agent NANDaOl = a.NAND001 + b.NANDabl ;
agent NAND0b1 = a.NANDabl + b.NAND0 01 ;
agent NANDabl = 'c.NANDab0;
agent NANDab0 = a.NAND0b0 + b.NANDa0 0 ;
agent NAND0b0 = b.NAND0 0 0 + 'c.NAND0b1;
agent NANDa0 0 = a.NAND000 + 'c.NANDa01;
agent NAND0 0 0 = a.NANDa0 0 + b.NAND0b0 + 'c.NAND001;
Fig. 11. The semi-modular specification of a 2-input NAND gate. Inputs 
that would disable an output are not permitted. This creates semi-modular 
computation interference errors in the verification. The state mapped to the 
logic level of the inputs as 0 or name of the pin (e.g. {0,a}). The output is 
specified as its logic level.
The speed-independent verification constraints are key con­
straints that m ust be optim ized through the CAD tools for 
tim ing driven place and route to ensure correct tim ing in 
the design. For LC these constraints ensure that the tim ing 
o f the feedback for the local state variable through the C- 
elem ent holds. The next set relates to the verification o f three 
p ipelined protocols that exposed the constraints due to the 
tim ed protocol. These constraints do not need to be included 
in the synthesis and place and route flows because o f the 
m agnitude o f the slack betw een the two race paths. The late 
arriving path for these delays goes through m ultiple LC cells 
and potentially delay elem ents w hereas the fast path is an 
internal feedback in the LC cell. The final set o f constraints 
w ere generated from  the verification betw een the specification 
and the delay-insensitive im plem entation model. These w ire 
forks constraints are not norm ally used for synthesis, but are 
validated post-layout.
The final set o f constraints use m ax and m in delay con­
straints, as illustrated in Fig. 14. These are derived from  the 
verification o f the pipelined protocol w ith datapath models. 
Each POD constraint is broken into a set o f constraints -  one 
for the fast path and a pair o f constraints for the slow path. 
The m inim um  delay o f the fast path through the datapath logic 
is constrained w ith a m ax delay constraint equaling the cycle 
tim e m inus setup and hold tim es o f the logic ($clk_period). 
The slow clock path is constrained w ith a m in-delay constraint, 
w hich creates the delay elem ent if  necessary ($req_del_m in). 
To ensure a tight bound for this constraint, a m ax delay that 
is slightly larger than the min delay ($req_del_m ax) is also 
applied to this path. The constraint shown in this exam ple
ensures that the data through the x 2 logic arrives before the 
clock. D elay elem ents w ill be added in the control path.
W hile only a portion o f the constraints are used in the 
synthesis flow, all are used for post-layout validation, including 
the D I constraints. The correct application o f the data check 
constraints m ust be checked w ith report_tim ing com m ands as 
shown in Fig. 13.
4 ) D A G  T im ing g ra p h  genera tion : The tim ing driven syn­
thesis and optim ization algorithm s in clocked CAD all work 
on directed acyclic graphs (DAGs). Further, m any o f these 
algorithm s are restricted to paths defined as “clocks” . M ost 
asynchronous tem plates are sequential designs w ith feedback, 
w hich can be seen by exam ining Fig. 4. The handshake 
protocols them selves produce cycles (Fig. 3). Im portant paths 
through these cycles m ust be defined as clocks and broken 
into DAGs w ithout breaking essential tim ing paths.
Loop breaking algorithm s exist in the clocked CAD. How­
ever, the com m ercial software cuts the cycles in such a way 
that many o f the necessary tim ing paths are broken. This 
results in constraints that cannot be applied to the design, 
poor sizing and power, and potential failures in the design. 
Integrating the generation of co rrec t  DAGs through cycle 
cutting in the im plem entation is therefore an essential part o f 
the library characterization. To ensure that all o f the constraints 
are correctly applied to the design, a report_tim ing com m and 
should be added for every constraint as shown in Fig. 13. 
These loop cutting constraints for LC are shown in Fig. 15.
New graph cutting algorithm s need to be developed to 
autom atically define “clock” paths the algorithm s can trace, 
and ensure that all the constraints can are applied in the 
synthesis and validation runs. This approach w ould ensure the 
point-of-divergence o f the RT constraints and all subsequent 
paths to the points-of-convergence are not broken. Even w ith 
optim al algorithm s a single set o f cuts m ight not be possible, 
and m ultiple tool runs m ay be necessary.
5) P ro tec tin g  D esig n  F idelity: A  final set o f constraints are 
necessary to ensure that the characterization process rem ains 
valid through the tool flows. M any parts o f the flow, including 
the synthesis and place and route tools, can optim ize the logic 
by rem apping gates. W hile this in general can im prove the 
design, m odifications to sequential asynchronous controllers
157
speed-independent design constraints:
set_data_check -fall_from */lc1/A2 -fall_to */lc1/B1 -setup $race_margin 
set_data_check -fall_from */lc3/A2 -falLto */lc3/B1 -setup $race_margin 
external protocol constraints: 
set_data_check -fall_from */lc1/A1 -rise_to */lc1/B1 -setup 0 
set_data_check -fall_from */lc3/A1 -rise_to */lc3/B1 -setup 0 
set_data_check -fall_from */lc5/A -rise_to */lc5/Y -setup 0 
set_data_check -fall_from */lc5/B -rise_to */lc5/Y -setup 0 
wire fork constraints: 
set_data_check -rise_from */lc3/A2 -falLto */lc3/A1 -setup 0 
set_data_check -rise_from */lc1/A2 -falLto */lc1/A1 -setup 0 

























Fig. 13. Report statements to validate the timing 
constraints in Fig. 12
Fig. 12. Timing constraints of implementation of Fig. 7
Latch timing constraints:
set_max_delay $clk_period -from R0_reg/q -to R1_reg/d 
set_min_delay $req_del_min -rise_from tk0/lr -rise_to tk10/lr 
set_max_delay $req_del_max -rise_from tk0/lr -rise_to tk10/rr
Fig. 14. Protocol level constraints for the linear control template
breaking local cycles:
set_disable_timing -from A2 -to Y [find -hier cell *lc1] 
set_disable_timing -from B1 -to Y [find -hier cell *lc1] 
set_disable_timing -from A2 -to Y [find -hier cell *lc3] 
set_disable_timing -from B1 -to Y [find -hier cell *lc3] 
breaking handshake protocol cycles: 
set_disable_timing -from A1 -to Y [find -hier cell *lc1] 
set_disable_timing -from A1 -to Y [find -hier cell *lc3] 
set_disable_timing -from B0 -to Y [find -hier cell *lc3]produce results that at best don’t m atch the verification results, 
and at w orst produce non-functional logic due to hazards. 
Applying the size_only property (Fig. 8 ) to all logic gates 
ensures that they will not be logically modified through the 
tool flows. This constraint allows the gates to be optim ally 
sized in the tim ing driven pow er and perform ance optim ization 
algorithms.
IV. D e s ig n  E x a m p l e s
Fig. 3 shows a datapath used to illustrate synthesis, place 
and route, and post-layout validation. The Verilog used to 
synthesize this pipeline is shown in Fig. 16. In general, our 
approach im poses the following requirem ents on an im ple­
mentation:
1) Only fully characterized tem plates can be used in the 
control path.
2) All paths in the handshake clocking m ust be point-to- 
point betw een characterized tem plate modules.
3) N etw ork liveness requires com plem entary tem plate pairs 
that im plem ent dual data steering fan-out and fan-in 
operations.
M any tem plates im plem ent the com plem entary or dual op­
eration through a sim ple structural m irroring o f the design. For 
example, the Fork/Join tem plate in Fig. 17 will im plem ent a 
fork operation; but when m irrored horizontally it im plem ents a 
jo in  operation o f two handshake paths. Thus a single tem plate 
is used for either datapath forking or jo in ing  operations.
The datapath in the exam ple contains branches and forks. 
These m ust all be broken in the control path by correctly 
inserting the handshake tem plates to ensure a point-to-point 
netw ork connection. These elem ents m ust also be inserted in 
a w ay that im plem ents com plem entary operations; every fork 
in the datapath m ust be associated w ith a jo in , and so forth.
Fig. 15. Loop breaking constraints
V. R e su l t s
Twelve different versions o f the Verilog exam ple were 
synthesized, sim ulated and evaluated in order to dem onstrate 
the flexibility and advantages of this tool flow. The different 
versions include (i) m apping the design to latches or flops,
(ii) using an incom plete set o f constraints, (iii) having various 
frequencies for each pipeline stage, and (iv) applying tim e 
borrowing to the latch design. A ll designs started w ith the 
sam e behavioral m odule o f Fig. 16 w ith one exception -  the 
flop based designs required replacing the latch_active_high 
m odule w ith a structural flop bank. A ll designs w ere syn­
thesized, physically placed and routed, and sim ulated using 
post-layout parasitics to generate delay and pow er results.
The reported results used the Artisan library for the IBM  
65nm  10sf process using full layout and parasitic extraction. 
D esign com piler was used for synthesis, M odelsim  was used 
for simulation, and SoC Encounter was used for place, route, 
and parasitic extraction. The pow er and delay num bers used 
sdf parasitic back annotation into the M odelsim . The pow er 
num bers w ere generated using parasitic extraction and activity 
factors from  a sim ulation run by im porting a vcd file from  
M odelsim  into SoC Encounter. The sim ulation run exhaus­
tively executing all input values from  zero to 256 w hile also 
validating functionality. Post layout tim ing was validated using 
the full set o f constraints, including the D I w ire constraints, 
using Prim eTim e w ith extracted parasitics.
Two delays are critical in these designs for tim ing driven 
synthesis and place and route: the delay of the com binational 
logic and the delay o f the control logic to ensure proper 
storing o f the data. Each o f these delays can be independently
158
module toy (din, dout, lr, la, rr, ra, rst);
input lr, ra, rst; output la, rr; input [15:0] din; output [31:0] dout; 
reg [31:0] R0, R10, R11, R2;
assign dout = R2_q;
always @(*) R0 = din;
linear_control tk0 (.ck(ck0), .lr(lr), .la(la), .rr(r0), .ra(a0), .rst(rst));
latch_active_high R0_reg (.d(R0), .clk( ck0), .q(R0_q));
bcast_fork bcf0  (.bi(r0 ),.bo0 (r00),.bo1(r0 1 ),.ji0 (a00 ),.ji1(a0 1 ),.jo(a0));
always @(*) R10 = R0_q * R0_q;
linear_control tk 10 (.ck(ck10), .lr(r00),.la(a00),.rr(r10),.ra(a10),.rst(rst));
latch_active_high R 10_reg (.d(R10), .clk( ck10), .q(R10_q));
always @(*) R11 = R0_q * 3;
linear_control tk 11 (.ck(ck1 1), .lr(r0 1 ),.la(a0 1),.rr(r1 1 ),.ra(a1 1 ),.rst(rst));
latch_active_high R 11_reg (.d(R1 1 ), .clk( ck11), .q(R11_q));
bcast_fork bcm0 (.bi(a1),.bo0 (a10),.bo1(a1 1),.ji0 (r10),.ji1(r1 1 ),.jo(r1));
always @(*) R2 = R 10_q + R 11_q;
linear_control tk2 (.ck(ck2), .lr(r1), .la(a1), .rr(rr), .ra(ra), .rst(rst));
latch_active_high R2_reg (.d(R2), .clk( ck2), .q(R2_q)); 
endmodule // toy





Fig. 17. Fork/Join Template
Flip-Flops Latches
ICS FCS ICS FCS
Avg. energy (nJ) 0.762 0.493 0.673 0.406
Avg. sw. energy 0.673 0.158 0.305 0.169
Avg. intrnl energy 0.440 0.308 0.343 0.212
Avg. leakge enrgy 0.031 0.028 0.025 0.025
Area (m m 2) 12,724 12,294 11,215 10,770
Datapath clk per. 2.0 2.0 2.0 2.0
Control delay 2.5 2.0 2.0 2.0
set for each pipeline stage. For all com parable designs, the 
com binational logic betw een flops or latches had the same 
target delay. However, the delay elem ent betw een control 
logic m ay be sized differently based on the efficiency of 
synthesizing the control logic as w ill be shown.
D ata m ust be valid before the rising edge of Ir into the 
control logic for the LC protocol employed. N ote that for 
efficient operation, a un id irec tio n a l delay betw een rr and Ir in 
the pipeline is desired, w here the rising delay is large and the 
falling delay is as small as possible. However, the scripts result 
in the clocked CAD autom atically generating bidirectional 
delays. Unfortunately, bidirectional delays result in over a 
1 0 0 % delay overhead for protocols w here data is valid on 
the rising edge of Ir. Efficient designs m ust em ploy different 
protocols or unidirectional delays. However, this protocol 
w orks w ell for our exam ple pipeline because it provides an 
am ple tim e borrowing window. Tim e borrowing in the design 
occurs in two forms. F irst, for the sim ple design exam ple (see 
Fig. 3) the delay through the 16-bit m ultipliers o f the second 
pipeline stage are m uch larger than the 32-bit adder delay in 
the final stage. This allows the stages previous to the adder 
stage to borrow some o f its cycle time. Second, variation 
in a design can be m itigated by tim e borrowing. Latches 
are operated in a norm ally closed m ode in the design. This 
allows tim e borrowing to occur based on the delay betw een
TABLE I
Example comparing flop and latch based design with identical
PIPELINE FREQUENCY. THE ICS COLUMN USES AN INCOMPLETE 
CONSTRAINT SET. ENERGY REPORTED IN PJ PER TOKEN, CLOCK PERIOD 
IN NSEC.
Ia asserting and deasserting because new data w ill not be 
propagated forw ard until Ia lowers (see Fig. 5 and 6 ).
O ne o f the prim ary exam ples o f this tool flow is to evaluate 
the effectiveness o f tim ing driven synthesis and place and 
route o f the asynchronous tem plates. This is dem onstrated by 
utilizing an incom plete constraint set (ICS) from  the tem plate 
characterization, as well as the full constraint set (FCS) for 
each version o f the design. The incom plete constraint set util­
izes all o f the relative-tim ing generated constraints, but allows 
the clocked CAD tools to utilize their internal cycle cutting 
algorithm s to generate the tim ing DAGs. Thus, the incom plete 
constraint set leaves out the loop breaking constraints in the 
flow shown in Fig. 15.
Table I shows four designs synthesized to com pare the 
pipeline using flops versus latches in the datapath. Com paring 
the flopped pipeline versus a latch pipeline gives the expected 
results: the latch design is m ore energy efficient ( 1 2 % & 
18% respectively for ICS and FCS) and sm aller ( ^  12% 
for both). The full constraint set designs (FCS) show a large 
im provem ent in pow er and m inor area reduction. The tim ing
159
Flip-Flops Latches
ICS FCS ICS FCS
Avg. energy (nJ) 0.752 0.492 0.677 0.398
Avg. sw. energy 0.285 0.159 0.308 0.167
Avg. intrnl energy 0.439 0.306 0.349 0.206
Avg. leakge enrgy 0.028 0.027 0.021 0.025
Area ( m m 2) 12,878 12,258 11,516 10,887
Datapath clk per. 
multipliers 2.0 2.0 2.0 2.0
adder 1.4 1.4 1.4 1.4
Control delay 
multipliers 3.2 2.0 2.0 2.0
adder 1.5 1.4 1.4 1.4
ICS FCS ICS FCS
Avg. energy (nJ) 0.670 0.378 0.670 0.377
Avg. sw. energy 0.309 0.160 0.309 0.158
Avg. intrnl energy 0.343 0.201 0.343 0.203
Avg. leakge enrgy 0.017 0.016 0.017 0.017
Area (m m 2) 11,264 10,739 11,258 10,937
Datapath clk per. 
multiplier 2.0 2.0 2.0 2.0
adder 2.0 2.0 1.1 1.1
Control delay 
multiplier 1.2 1.1 1.2 1.1
adder 1.2 1.1 1.1 1.1
TABLE II
Version with variable pipeline frequencies.
TABLE III
Latch based time borrowing versions with and without 
variable pipeline frequencies using incomplete and complete
TIMING PATH CONSTRAINTS.
optim ized design resulted in a 35% and 40% reduction in 
energy for the flop and latch designs respectively. For the flop 
design, there is also a significant im provem ent in perform ance, 
as the im properly constrained design requires control delay 
25% slower than the datapath to operate properly. Inspecting 
the post-layout netlist reveals that the ICS design substantially 
oversized many gates. For example, the tools sized an AOI32 
gate o f Fig. 7 six tim es larger in the ICS versions as com pared 
to the FCS versions o f the design. This larger gate is energy 
inefficient and creates skew in the delay paths that ultim ately 
result in a 25% slower circuit. However, for the latch design, 
the sam e control target frequency as the FCS version can be 
used due to tim e borrowing that occurs.
Table II shows four new designs w here the pipeline stages 
are independently assigned delays to optim ize the pow er-delay 
product for each pipeline function. The 16-bit m ultipliers were 
given a target cycle tim e o f 2.0ns, and the 32-bit adder a cycle 
tim e o f 1.4ns. This exam ple shows that even w ith traditional 
clocked tools, this characterization flow is able to directly 
synthesize and validate m ulti-frequency pipelined designs. 
Like the case w ith a single frequency, the full constraint set 
results in lower area and pow er than the unconstrained set, as 
well as a faster design (ignoring tim e borrowing that occurs 
for the latched ICS version).
The final four designs show how this flow can be used 
to actively exploit tim e borrowing betw een pipeline stages 
in the clocked CAD. This is achieved w ithout changing the 
synthesis scripts. The only change is in assigning different 
delay values to the control path. The first two versions o f the 
design, shown in Fig. III, use a fixed frequency for all datapath 
pipeline stages. The last two versions use different frequencies 
for the m ultiplier and adder stages. The prim ary difference 
betw een the fixed and m ulti-frequency designs is that the 
m ulti-frequency design slightly constrains the w orst case adder 
path, w hich results in a very sm all reduction in cycle tim e and 
energy. The m ost significant observation from  these designs is 
the ability for tim e borrowing to m itigate variations in the 
design, w hether the source is from  poor frequency or design 
optim ization (as can be seen by the energy difference of 44%).
RT Constraints Setup (ns) Slack (ns)
lrf ^  rrf ^  y _ | 0.05 0.16
lrf ^  laf ^  y _ | 0.05 0 .12
lrf ^  laf ^  ra_J. 0.00 0.92
lrf ^  rrf ^  lrj. 0.00 0.80
TABLE IV
Data check timing report summary some RT constraints. Listed
SLACKS ARE ALL WORST CASE.
All relative-tim ing constraints, including the delay- 
insensitive constraints, are used to validate post-layout tim ing 
(using extracted layout parasitics im ported as standard delay 
file) in PrimeTime. The tim ing report validated that all the 
constraints used for tim ing driven synthesis and place and 
route are correct w ith positive slack. In latch based pipeline 
im plem entation the m ultiplication latch stages can use tim e 
borrowing from  the next stage. Tables IV and V show a brief 
sum m ary o f the tim ing reports.
V I. C o n c l u s io n s
This paper shows how asynchronous Verilog behavioral 
designs can be characterized in a w ay that allows them  to be 
synthesized, optim ized, and validated using traditional clocked 
tool flows. This m ethodology requires the asynchronous blocks 
to be designed as precharacterized tem plates that are struc-
PathType From To Constr. LSup MxTB TB/Slk
DataPath R0 R10 max 1.70 0.20 0.65 0.25
DataPath R10 R2 max 1.08 0.17 0.68 0.20
DataPath R0 R11 max 1.70 0.17 0.68 0.01
DataPath R11 R2 max 1.08 0.20 0.65 0.20
CtrlPath tk0/lr tk10/lr min 1.19 N/A N/A 0.12
CtrlPath tk10/lr tk2/lr min 1.08 N/A N/A 0.13
CtrlPath tk0/lr tk11/lr min 1.19 N/A N/A 0.12
CtrlPath tk11/lr tk2/lr min 1.08 N/A N/A 0.11
TABLE V
Timing report summary for constraints between pipeline 
stages. The latches in datapath borrow time from the next 
STAGES WITH LSUP (LIBRARY SETUP TIME), MxTB (MAXIMUM TIME 
BORROWING) AND TB (REAL TIME BORROWING) LISTED. ALL THE 
NUMBERS ARE IN NANOSECONDS.
160
turally  inserted into the behavioral design at each pipeline 
stage. The characterization m ethodology is based on for­
m al verification and relative tim ing to generate several sets 
o f constraints ranging from  from  key tim ing driven speed- 
independent constraints, to a com plete set o f delay-insensitive 
constraints. The full constraint generation flow was dem on­
strated for a linear pipeline controller cell.
A  sim ple design was used to dem onstrate the functionality 
o f the design flow and show how different versions can 
easily be generated by m odifying tim ing constraints. Twelve 
different versions o f the behavioral design w ere synthesized 
and evaluated in IB M ’s 65nm  10sf process. These designs 
dem onstrated the perform ance and pow er benefits o f this 
flow as the com plete constraint set showed up to a 44% 
reduction in pow er com pared to one that allowed autom atic 
cycle cutting. The tools w ere used to autom atically synthesize 
designs m apped to flops, latches, variable frequency pipelines, 
and tim e borrowing designs. The benefit o f a latch based 
design was dem onstrated, showing up to a 1 2 % area reduction 
and 19% reduction in energy over the flop based version. 
Variable pipeline frequency did not substantially change the 
perform ance, power, or area o f this linear fork/join pipeline. 
Tim e borrowing was able to substantially m itigate variations in 
the controller, and reduce the perform ance constraining cycle 
tim e by up to 45%, and reduce the energy by up to 5% over 
the fixed frequency latch based design.
The flow presented here opens the capability for any clocked 
designer to create handshake clocked asynchronous designs 
using asynchronous tem plates characterized w ith this flow. 
As such, this is an im portant first step to achieving the 
evolutionary integration o f asynchronous handshake clocking 
into 2 0 % o f the sem iconductors by 2 0 1 2  as predicted by the 
ITRS.
V II. A c k n o w l e d g m e n t s  
We w ould like to recognize the generous funding from  SRC 
and N SF who supported this research, A RM  for providing the 
65nm  library, M OSIS and IBM  for the foundry information.
R e f e r e n c e s
[1] Nikolas Andrikos, Luciano Lavagno, Davide Pandini, and Christos P. 
Sotiriou. A Fully-Automated Desynchronization Flow for Synchronous 
Circuits. In Design Automation Conference, pages 982-985. ACM/IEEE, 
June 2007.
[2] I. Blunno, J. Cortadella, A. Kondratyev, L. Lavagno, K. Lwin, and
C. Sotiriou. Handshake protocols for de-synchronization. In Interna­
tional Symposium on Asynchronous Circuits and Systems, pages 149­
158. IEEE, Apr 2004.
[3] Tam-Anh Chu. Synthesis of Self-Timed VLSI Circuits From Graph- 
Theoretic Specifications. PhD thesis, Massachusetts Institute of Tech­
nology, September 1987.
[4] Jordi Cortadella, Alex Kondratyev, Luciano Lavagno, and Christos P. 
Sotiriou. Desynchronization: Synthesis of asynchronous circuits from 
synchronous specifications. IEEE Transactions on Computer-Aided 
Design ofIntegrated Circuits and Systems, 25(10):1904-1921, Oct 2006.
[5] Karl M. Fant and Scott A. Brandt. NULL Convention Logic: A Complete 
and Consistent Logic for Asynchronous Digital Circuit Synthesis. In
International Conference on Application-Specific Systems, Architectures, 
and Processors, pages 261-273, 1996.
[6] TiDE White Paper, Handshake Solutions. V 1.0, June 2007.
[7] Hoshik Kim, Peter A. Beerel, and Kenneth S. Stevens. Relative timing 
based verification of timed circuits and systems. In 8th International 
Symposium on Asynchronous Circuits and Systems, pages 115-126. 
IEEE Press, April 2002.
[8] Alex Kondratyev and Kelvin Lwin. Design of Asynchronous Circuits 
Using Synchronous CAD Tools. IEEE Design & Test of Computers, 
19(4):107-117, July-Aug. 2002.
[9] Michiel Ligthart, Karl Fant, Ross Smith, Alexander Taubin, and Alex 
Kondratyev. Asynchronous Design Using Commercial HDL Synthesis 
Tools. In International Symposium on Advanced Research in Asyn­
chronous Circuits and Systems, pages 114-125. IEEE, Apr 2000.
[10] Alain J. Martin. Practical asynchronous circuits and tools. IEEE Design 
& Test of Computers, 19(4):108, July-Aug. 2002.
[11] Robin Milner. Communication and Concurrency. Computer Science. 
Prentice Hall International, London, 1989.
[12] Ad Peeters and Kees van Berkel. Synchronous handshake circuits. 
In Seventh International Symposium on Asynchronous Circuits and 
Systems, pages 86-95. IEEE, Mar 2001.
[13] Bradley R. Quinton, Mark R. Greenstreet, and Steven J.E. Wilton. 
Asynchronous IC Interconnect Network Design and Implementation 
Using a Standard ASIC Flow. In International Conference on Compulter 
Design: VLSI in Computers and Processors, pages 267-274. IEEE, Oct 
2005.
[14] Oriol Roig, Jordi Cortadella, and Marco A. Pena. Automatic generation 
of synchronous test patterns for asynchronous circuits. In Design 
Automation Conference, pages 620-628. ACM/IEEE, June 1997.
[15] S. Rotem, K. Stevens, R. Ginosar, P. Beerel, C. Myers, K. Yun, R. Kol,
C. Dike, M. Roncken, and B. Agapiev. RAPPID: An Asynchronous In­
struction Length Decoder. In 5th International Symposium on Advanced 
Research in Asynchronous Circuits and Systems, pages 60-70. IEEE, 
April 1999. Best paper award.
[16] Semiconductor Industry Association. The International Technol­
ogy Roadmap for Semiconductors, 2005 edition edition, 2005. 
http://www.itrs.net/links/2005itrs/design2005.pdf.
[17] Sanjit A. Seshia, Randall E. Bryant, and Kenneth S. Stevens. Mod­
eling and verifying circuits using generalized relative timing. In 11th 
International Symposium on Asynchronous Circuits and Systems, pages 
98-108, March 2005.
[18] Ken Stevens, Shai Rotem, Ran Ginosar, Peter Beerel, Chris Myers, 
Kenneth Yun, Rakefet Kol, Charles Dike, and Marly Roncken. An 
Asynchronous Instruction Length Decoder. IEEE Journal of Solid State 
Circuits, 36(2):217-228, February 2001.
[19] Kenneth S. Stevens. Practical Verification and Synthesis of Low Latency 
Asynchronous Systems. PhD thesis, University of Calgary, Calgary, 
Alberta, Canada, September 1994.
[20] Kenneth S. Stevens, Ran Ginosar, and Shai Rotem. Relative Timing. 
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 
1(11):129-140, February 2003.
[21] Kenneth S. Stevens, Shai Rotem, Steven M. Burns, Jordi Cortadella, Ran 
Ginosar, Michael Kishinevsky, and Marly Roncken. CAD Directions for 
High Performance Asynchronous Circuits. In Proceedings of the Digital 
Automation Conference (DAC99), pages 116-121. IEEE, June 1999.
[22] Frank te Beest, Ad Peeters, Kees van Berkel, and Hans Kerkhoff. 
Synchronous full-scan for asynchronous handshake circuits. Journal 
of Electronic Testing, 19(4):397-406, Aug 2003.
[23] Kees van Berkel, Ronan Burgess, Joep L. W. Kessels, Ad Peeters, Marly 
Roncken, and Frits Schalij. A Fully Asynchronous Low-Power Error 
Corrector for the DCC Player. IEEE Journal of Solid-State Circuits, 
29(12):1429-1439, Dec 1994.
[24] Ted E. Williams and Mark A. Horowitz. A 160ns 54bit CMOS division 
implementation using self-timing and symmetrically overlapped SRT 
stages. In Peter Kornerup and David W. Matula, editors, Proceedings 
of the 10th IEEE Symposium on Computer Arithmetic, pages 210-217, 
1991.
161
