Relative timing by Stevens, Kenneth & Ginosar, Ran
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 11,NO.1, FEBRUARY 2003 129
Relative T im ing
Kenneth S. Stevens, Senior Member, IEEE, Ran Ginosar, Member, IEEE, and Shai Rotem
A b strac t— R ela tiv e  tim in g  (R T ) is in tro d u c e d  a s  a  m eth o d  
fo r a sy n ch ro n o u s  design . T im in g  re q u ire m e n ts  o f  a  c irc u it a re  
m ad e  exp lic it u sing  re la tiv e  tim in g . T im in g  c an  b e  d irec tly  a d d ed , 
rem o v ed , a n d  o p tim ized  u sin g  th is  sty le. R T  syn thesis a n d  ve ri­
fica tion  a re  d e m o n s tra te d  on  th re e  ex am ple  c ircu its , fac ilita tin g  
tra n s fo rm a tio n s  fro m  sp eed -in d ep en d en t c ircu its  to  b u rs t-m o d e  
a n d  p u lse -m o d e  c ircu its . R ela tive  tim in g  en ab les im p ro v ed  p e rfo r­
m an ce , a re a , pow er, a n d  fu n c tio n a l te s tab ility  o f  u p  to  a  fa c to r  o f
3 x in  a ll th re e  cases. T h is  m eth o d  is th e  fo u n d a tio n  o f  o p tim ized  
tim ed  c irc u it designs u sed  in  a n  in d u s tr ia l  te s t ch ip , a n d  m ay  be 
fo rm alized  a n d  au to m a te d .
In d e x  Terms— A sy n ch ro n o u s design , d y n am ic  logic c irc u it, h igh  
p e rfo rm a n ce , low -pow er design , p e rfo rm a n ce  trad eo ffs .
I. INTRODUCTION
T
HE design o f RAPPID, the asynchronous instruction 
length decoder, took m ore than two years to com plete 
[1]. The prim ary goal was to investigate whether asynchronous 
design could improve perform ance in high-end m icroproces­
sors. This naturally led to the effort, reported in this paper, to 
study and develop circuits, com puter-aided design (CAD), and 
methodology m ost suitable for aggressive tim ed asynchronous 
circuit design.
Initial designs and m ethods were based on the CAD available 
at that time. The circuits were specified and synthesized using 
speed-independent (SI) or burst-m ode (BM /XBM ) m ethod­
ologies [2]-[4], as well as m etric tim ed circuit design [5]. We 
quickly discovered that m any o f the circuits that achieved our 
perform ance goals contained some form  o f timing assum p­
tions— either the fundam ental m ode assum ption o f burst-m ode 
or gate-level m etric timing. The perform ance was improved by 
studying the natural delays o f the circuits to em ploy tim ing that 
sim plified the designs by reducing series transistors and logic 
levels.
Unfortunately, all the asynchronous m ethodologies at that 
tim e had w hat we considered an im pedim ent to conceptual­
izing, optim izing, validating, and interfacing tim ed circuits. The 
timing assumptions were all implicit. We felt that in m any cases, 
the key perform ance was achieved through careful managem ent 
and design o f the timing  o f the circuits as m uch as the behavior. 
We therefore studied ways to make the tim ing o f circuits ex­
plicit. This effort resulted in the relative timing (RT) style re­
ported here.
Manuscript received March 5, 2001; revised June 29, 2001 and January 7, 
2002.
K. S. Stevens and S. Rotem are with Strategic CAD Labs, Intel Corporation, 
Hillsboro, OR 97124 USA.
R. Ginosar is with the VLSI Systems Research Center, The Technion, Haifa, 
Israel.
Digital Object Identifier 10.1109/TVLSI.2002.801606
Relative timing proved to be a  very effective m ethod o f sub­
stituting aggressive pulse-m ode self-resetting circuits for the 
original full-handshake speed-independent designs in RAPPID. 
This novel m ethod also allowed us to design and verify specula­
tive asynchronous state m achines. However, this effort required 
an ew  way o f thinking about asynchronous designs and required 
a new set o f tools.
In the absence o f RT CAD tools, the m anual flow is quite in­
efficient for the design o f large systems. Now we face the ques­
tion o f how our m anual m ethod can be form alized into an effec­
tive CAD m ethodology and tools. We propose that new formal 
methodologies and tools be developed to support this method. 
This paper presents our m ethodology and lessons in order to 
motivate further CAD development. We start w ith simple, con­
trived examples that dem onstrate basic principles, and m ove to 
a key RAPPID circuit that has been im proved substantially with 
relative timing.
II. MOTIVATION AND DESCRIPTION
The design o f tim ing in digital circuits is an extremely 
difficult challenge. The conventional clocked digital design 
m ethodology solves this problem  by decomposing the circuit 
into cycle-free com binational logic (CL) stages and interstage 
clocked latches; the clock cycle is sim ply tuned to accom ­
m odate the worst case propagation delay in the CL stages. 
The behavior o f the com binational logic can be specified and 
synthesized w ithout considering timing. Delay-insensitive (DI) 
asynchronous circuits are analogous to clocked CL design in 
the sense that both types are independent o f tim e— the behavior 
will be correct for arbitrary gate and wire delay.
H igh-perform ance circuits, both clocked and asynchronous, 
benefit from  m ore aggressive tim ing methodologies. Clocked 
circuits can be considerably enhanced using local self-timing
[6]-[8]. Timed asynchronous circuits can also have significantly 
enhanced perform ance.
Asynchronous design consists o f handshake protocols that 
ensure the validity o f data [9], [10]. Asynchronous design 
methodologies, apart from  DI, m ake timing assumptions in 
the protocols, function logic, or data transm ission [11]. If  the 
assumptions are invalid in the physical im plementation, the 
circuits can glitch and fail to operate correctly. SI circuits 
assum e indistinguishable skew on wire forks, burst mode 
assumes fundam ental m ode (the circuit w ill stabilize internally 
before new inputs arrive), and bundled data assumes that all 
data is stable before the handshake signal arrives. Ensuring 
that the tim ing assumptions hold in tim ed design, such as burst 
mode, can be challenging [12].
The design style we investigated explicitly specifies theeffect 
o f delays in a circuit in terms of assertions on relative ordering
1063-8210/03$17.00 © 2003 IEEE
130 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 11, NO. 1, FEBRUARY 2003
o f events (e.g., a  goes high before b goes low). Our applica­
tion o f relative tim ing is based on the unbounded delay m odel 
com m only used by many asynchronous synthesis and verifica­
tion tools. The circuits are then designed to m eet the relative 
orderings and validated that the constraints are part o f the nat­
ural delays in the system.
A num ber of benefits em erged from  m aking RT constraints 
explicit in our designs. Timing relationships are no longer 
hidden by a design style or tool. RT can unify the asynchronous 
methodologies as well as provide support for ad hoc m anual 
designs. The bundled and burst-m ode assumptions, for ex­
ample, can usually be m ade explicit with a small num ber 
o f RT constraints, as shown in Section IV-B4. The explicit 
nature o f the constraints can simplify interfacing, synthesis, 
and perform ance verification. RT is not restricted to any 
particular specification style and supports arbitrary designs. 
Since tim ing can directly effect the quality and robustness of 
the circuits, each assum ption can be individually evaluated, and 
its application can be aggressive or conservative.
M any tim ing CAD tools and m ethodologies exist; asyn­
chronous design itself is a tim ing methodology. Ordering 
signals tem porally is not novel. This ordering can be achieved 
through graph transformations that reduce concurrency similar 
to the theory developed by Vanbekbergen [13]. Timed Petri 
nets, tim ed finite-state m achines, and other bounded-delay 
formalisms have been used to reason about tim ed circuits in 
[14]-[20]. Com ponent databooks include waveforms showing 
relative signal orderings, and orderings have been applied 
to m icropipeline latches and controllers [21]-[23]. These 
methodologies can achieve extrem ely efficient circuits; indeed, 
the tag unit in RAPPID, used as the prim ary exam ple in this 
paper, was first specified, synthesized, and validated using the 
m etric tool ATACS [24].
However, we do feel that the RT m ethodology usedin  RAPPID 
applies tim ing top-down in a novel way that is intuitive and flex­
ible, creating com pact, testable, high-perform ance low-power 
circuits in a style that can be autom ated by CAD. Further, this 
m ethodology supports both autom atic and user-specified timing 
transform ations. Initial RT solutions based on this w ork applied 
to synthesis [20] and verification [25] show rem arkable results 
and potential for an autom ated RT design flow.
III. RAPPID RELATIVE TIMING DESIGN
Relative tim ing had a significant im pact on the RAPPID re­
sults. The tim ed asynchronous circuits, when com pared to sim ­
ilar clocked logic in a com m ercial synchronous im plementation, 
showed a 3 x im provem ent in throughput, a 2 x im provem ent in 
latency, and half the energy per operation, at a 20% area penalty
[1]. A lthough harder to quantify, we feel that relative timing was 
also key in achieving the 95%  stuck-at testability in RAPPID 
with our functional built-in self-test m ethod through removing 
redundancies that naturally result through fixed signal orderings 
induced by timing.
M ost o f the RT circuits in RAPPID were designed by hand. 
The RT transformations m odified m any behavioral aspects of 
the specifications, concurrency in particular. However, the es­
sential functionality o f the controllers— synchronization and or-
dering— remained. This effort, while tim e consuming, helped 
us better understand timing, tim ed technology m apping, and 
what types o f transformations appeared m ost beneficial. Var­
ious forms o f handshaking were investigated, including proto­
cols w ithout direct acknowledgm ent. These pulse-based proto­
cols can at times significantly improve the sim plicity and la­
tency o f asynchronous circuits.
M ost o f our implementations were m apped onto standard 
static and dom ino library cells. Dom ino circuits are a restricted 
class o f generalized C-elements [26], where only a single term  
exists in the reset function. The com bination o f state-holding, 
low transition latency, and low activity factor o f the domino 
gates m ade them  the best circuit alternative we investigated.
Section IV  describes the m ethod we developed for designing 
and optim izing relative-tim ed circuits. W hat began as a num ber 
o f circuit experiments evolved into a m anual flow. Autom ated 
tool support for these flows was painfully lacking, so we began 
m entoring developm ent of RT CAD. Early engagem ent with the 
Petrify team  led to autom ation o f synthesis using relative timing 
[20]. Verification using RT constraints was added to the verifi­
cation tool Analyze [27] in-house. This tool was used to opti­
m ize the constraints in a  slow, error-prone m anual loop. Theory 
automating the verification and RT constraint optim ization is 
under developm ent [25]. We encourage researchers to further 
form alize and develop new CAD for automating RT design.
IV. EXAMPLES
A. Notation and Terminology
Table I shows some notations used in this paper. The process 
logic CCS [28] is used in this paper, where “ .” is the sequential 
operator, “+ ” is the nondeterm inistic choice operator, “ |” is par­
allel com position, and “\{ a } ” is the restriction operator applied 
to signal a, which disallows independent a  and a  transitions. 
Restricting signal a  only permits the internal r  synchronization 
o f the “handshake” between a  and a.
A ll sim ulations have been m ade using synchronous standard 
library cells in a  0 .18-//. process. The output of each circuit 
drives a 0.18 x 25 // gate load. The circuits are sim ulated using 
SPICE and the values norm alized against one o f the circuits in 
terms o f area and energy. A m ore com plete m odeling o f some 
o f these circuits and param eters can be found in [29].
The circuit examples in this paper contain static and domino 
gates norm ally em ploying a single pM OS device. A syn­
chronous tools such as ATACS [5], 3D [4], [30], and Petrify
[2] can typically synthesize se t-reset flops and the appropriate 
functions [Fig. 1(a)]. We can often apply technology mapping 
into single-variable reset (equivalently set) functions and 
im plem ent them  using standard footed dom ino gates as in 
Fig. 1(b). W hen the reset variable is not used in the set function, 
an unfooted dom ino gate is used instead [Fig. 1(c)].
B. C-Element
We use a sim ple C-element exam ple to dem onstrate the con­
cepts and m ethods o f applying relative tim ing to synthesis and 
verification. A sim ple two-input generalized C-element and its 
CM OS im plem entation are shown in Fig. 2(a). The form al def-




input signal underline i n p u t
output signal o u tp u t
inverted (asserted low) over-bar z
rising transition up arrow a t
falling transition down arrow b4-
timing arc dashed arc -------






Fig. 1. (a) Set-reset flop and functions. (b) Footed domino gate (symbol and 
circuit) implementing a set-reset flop with f r = x, f  s = x x a x (b + c). 
(c) Unfooted domino gate implementing f r = x, f s = a x (b + c).
inition in CCS is C  =  (a  | b ) .z .C , which reads “C  is de­
fined as single transitions showing on inputs a, b in parallel (at 
any order), followed by a transition on the output z, then fol­
lowed recursively by C  again” [28]. An equivalent signal tran-
(c)
Fig. 2. Generalized C-elements: (a) GC, (b) GC-RT for a ^  b {, and (c) for 
a H b j .
sition graph (STG) [31], [32] representation of the specification 
is shown in Fig. 3(a) .
1) Relative Timing Synthesis: RT synthesis optimizes a cir­
cuit by adding tim ing arcs to a behavioral specification. Both 
tim ing and causality affect the behavior of an RT circuit. B e­
havioral arcs m ust be synthesized into gates, and tim ing rela­
tions enforce a specific ordering between concurrent events, re­
sulting in concurrency reduction in the specification.
Relative tim ing assumptions com e in two form s: local and 
global. Local tim ing constraints can automatically be generated 
by m oving behavioral arcs based on various assumptions such 
as lazy transition systems [20]. Global assumptions are dictated 
by the response of the environment. These assumptions can be 
applied manually, as in Section V-C, or automatically, as in the 
burst-mode assumption that a circuit will stabilize before a new 
input burst arrives [4], [30], [33].
RT synthesis supports the creation and strengthening of 
tim ing assumptions by moving the relative positions of the heads 
and tails of arcs in a specification. If  tim ing arcs are restricted 
to relative translations of behavioral arcs, aggressive timing 
optim izations can be perform ed on a circuit while ensuring a 
consistent, com patible result. The new specification can now 
be synthesized, and tim ing assumptions and requirem ents 
can be back-annotated. In this section, we show some simple, 
intuitive transform ations on C-elements. In Section V, we show 
aggressive application of relative tim ing in a large circuit.
132 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 11, NO. 1, FEBRUARY 2003
(c)
Fig. 3. Relative timing transformations on the Petri-net of a C-element: 
(a) initial spec, (b) relative timing arc RTA2 a J. -< b |  effectively “translates” 
arc (a j, z j) to (a j, b |) ,  and (c) new spec, x arc is redundant.
2) Synthesis Examples: Assum e that the environm ent 
always produces transitions on a  before transitions on b..This 
relative tim ing assumption is expressed as follows:
RTA1: a  -< b
the C-elem ent can be reduced to a buffer C  =  b .z .C  using this 
assumption. Fig. 3(b) shows the STG when the assumption is 
lim ited to the falling edges
R T A 2 : a | ^ b .  J,
(a) (b)
(c) (d)
Fig. 4. Static C-elements: (a) speed-independent, (b) with RT assumption 
a I -< b I , (c) with RT assumption a |  b f , and (d) burst-mode C-element 
with hazards.
the dashed arc represents the tim ing assumption RTA2. Note 
that the tim ing arc supersedes the behavioral arc from  a  j  to 
z j .  Relative tim ing effectively moves the tail of this arc from 
one event (z j )  to a predecessor o f the event (b j ) ,  as indicated 
by the double arrow in Fig. 3(b). The new tim ing arc makes the 
behavioral arc redundant, as shown in Fig. 3(c). In the corre­
sponding circuit, the reset function contains only b j ,  and the 
C-elem ent can be im plem ented as the Fig. 2(b) footed domino 
gate GC-RT
C  =  (a  | |  b f) .z  T -a I -b j  *z j  .C  
given a similar assumption on the positive edges 
RTA3:a b t
the circuit can be m apped to the domino gate in Fig. 2(c) by 
inverting the inputs and em ploying the nonbuffered z output.
Static C-elem ent implem entations can be synthesized with 
Petrify. The STG of Fig. 3(a) produces the speed-independent 
circuit (SIC) shown in Fig. 4(a). Timing assumptions RTA2 and 
RTA3 lead to the simpler static circuits of Fig. 4(b) and (c), 
respectively. Note that these two circuits are actually subcircuits 
o f the speed-independent circuit. 3D synthesizes the circuit o f 
Fig. 4(d).
In general, applying relative tim ing for synthesis means that 
new (timing) arcs are inserted, rendering other arcs redundant. 
This could also be considered as m oving the head, tail, or both 
ends of behavioral arcs to predecessors. This effectively reduces 
concurrency in the specification, allowing a sim pler im plem en­
tation by removing transistors and gates.
3) Relative Timing Verification: This section introduces the 
m ethod developed to verify a large, relative-tim ed asynchronous 
circuit called RAPPID [1]. An implem entation I  conforms to 
a specification S  ( /  S )  when an implem entation is an ac­
ceptable construction of the specification [16], [27], [34]. In this 
section, implem entations can be assumed to be parallel com ­
positions of the untim ed behavioral specifications of the gates. 
Relative tim ing predicates can be added to implem entations and 
specifications to reduce their concurrency by pruning states in 
a state graph (SG) that are unreachable due to timing. Thus, a 
specification S  conforms to an implem entation I  with RT pred­
icate R  when I  A R  S .
STEVENS et al.: RELATIVE TIMING 133
Early in this effort, the Analyze verifier was enhanced to sup­
port RT predicates on both implem entations and specifications. 
Circuits can then be verified using SI and D I unbounded delays 
with RT constraints.
RT verification has two aspects. First, RT constraints reduce 
concurrency in the im plem entation by disallowing transitions 
to failure states. Second, the set of RT constraints are optim ized 
and m erged through a set of transformations.
The following algorithm  was applied to generate RT con­
straints and verify RAPPID and the circuit examples in this 
paper. Step 1) generates RT constraints that remove a single 
failure state, as will be shown in the following example. This 
capability was added to our verifier. Step 2) optimizes the con­
straint by reducing additional concurrency beyond the single 
failure state. Step 3) adds the new optim ized RT constraint to 
the set, rem oving any constraints covered by the new constraint. 
Steps 2) and 3) were done manually.
1) Verify conform ance using current RT predicates.
• If  failure free, report RT constraints.
• If  failure cannot be fixed through timing, quit.
• If  failure exists, create RT constraint(s) that remove 
this failure.
2) Optimize new constraint(s).
• Remove concurrency by increasing coverage of the 
SG by the RT constraint.
• Iterate optimization, term inating when:
i) further concurrency reduction would remove 
states required by the specification;
ii) slack in constraint is no longer positive;
iii) an arc edge touches a prim ary input or 
output.
3) Add optim ized constraint to RT constraint set, remove 
covered constraints, and iterate.
Section IV-B-IV illustrates the procedure used in RAPPID to 
determ ine how and when to increase coverage of an RT con­
straint.
4) Verification Example: Consider the static C-elem ent 
(SC) in Fig. 4(d). This circuit is implicitly hazard-free under 
burst-mode or fundam ental-m ode assumptions. However, it 
is not hazard-free in a speed-independent environment. If the 
environm ent responds quickly, b J. may im m ediately follow 
z |  before node az  rises, resulting in a hazard.
Fig. 5 shows a state graph of the SC C-elem ent circuit. The 
“bottom ” symbols in the left and right corner of the diamonds 
label error states. Transitions ab h  and bz  h  lead to the error 
state on the right.
Using the m ethod described in Section IV-B3, we first try to 
elim inate the right error. Verification will identify any arcs that 
lead to error states. The following two constraints elim inate the 
right error state:
RTC4: az  ab h  
R TC 5: az  bz  h  .
If one signal m ust precede another and both exit from  a single 
state, then the later arc will never be taken (e.g., ab h  in RTC4). 
RTC4 and RTC5 therefore disallow entrance to the right error 
state.
J
Fig. 5. Relative timed burst-mode SC state graph.
One representation of the tim ed precedence of RTC4 is the 
dashed arc between az  j 4 and ab h  in the SG of Fig. 5. 
We now try to strengthen this constraint to cover m ore of the 
graph. W hile there may be many methods to optim ize the in­
stance-based constraints, our hand m ethodology used two main 
iterations.
First, instance inform ation is rem oved from  the constraints 
when possible. The generalized constraint az  j-< ab h  is 
equivalent to RTC4, as it adds no new tim ing arcs to the 
SG. Generalizing the right side as well results in the con­
straint az  j-< ab j ,  effectively adding a second tim ing arc 
az  ^2^  ab jo  to the SG. This constraint now removes two 
states from  the graph: the error state and the state of RTC5. 
Hence RTC5 is covered by the optim ized RTC4 constraint.
Second, the generalized RT constraint can be strengthened 
based on slack calculations.1 The constraint is strengthened by 
moving the left and/or right transition to earlier transitions in the 
SG. Hence the right side o f constraint RTC4 can be strengthened 
to transition b j  and bz  j . A  simple unit delay model can be used 
to calculate slack, where local gates are assigned a single delay 
and input transitions are assigned a value k , where k > 1. The
1Slack is the difference between the latest arrival of the first signal and the 
earliest arrival of the second.
134 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 11, NO. 1, FEBRUARY 2003
TABLE II
Comparison of C-Element Implementations. Fall, Rise, and Energy 
Columns Use Worst SIC Performance as Base, With All Other 
Numbers a Multiple of That Delay or Energy. Energy Is Average 
for a Complete Cycle (Rise and Fall). Test Columns Show COSMOS 
Stuck-At Fault Coverage on All Fanouts, With Reduced Patterns in 
















SIC Yes 1.00 0.98 1.00 16 100% 90%
SIC-RT No 0.59 0.52 0.74 8 n/a 100%
SC No 0.56 0.51 2.03 18 100% 92%
GC Yes 0.77 0.55 0.86 10 100% 100%
GC-RT No 0.52 0.52 0.72 9 n/a 100%
following example illustrates the strengthening of az  |  -< ab j  
starting from  the com mon signal z f :
zfaz 't -C z |{ b z t  b |} a b j  k 
z fa z j -< z |b  j  k — 1 
z fa z t -< z |b z |  0
this indicates that if  A: >  1, the best strengthening is RTC6; 
otherwise, the weaker az  |  -< ab j  should be used. Applying 
the same m ethod to the left error state generates RTC7
RTC6: az  f  -< b j  
RTC7: b z |  -< a  j
the RT im plem entation now conforms to the specification. P re­
cisely, SC A RTC6 A RTC7 y c C  =  (a  | b ) .z .C . All signals 
in these constraints are either prim ary inputs or directly enabled 
by the prim ary output, simplifying hierarchical validation.
In general, RT verification allows one to manipulate the ini­
tial constraints to arrive at a minimal set o f constraints that are 
easiest to verify in a hierarchical system. Constraints that have 
overaggressively reduced the slack can be weakened back to the 
original failure state. If  any initial constraint contains unachiev­
able tim ing, then the circuit is an invalid im plem entation of the 
specification.
RTC6 and RTC7 im plem ent a “w eak” form  of the funda­
m ental-m ode requirem ent of burst mode. Because Analyze uses 
bisim ulation semantics, hazardous behavior inside a circuit that 
does not propagate to the outputs is perm issible due to the ob­
servational equivalence property [27], [28]. (This is not the case 
when using verifiers based on weaker formalisms.) RTC6 and 
RTC7 prune arcs a  i.1,3,4 and b J. 1,3,4 but transitions a  J.2 and 
b I 2 of Fig. 5 remain. Given RTC6 and RTC7, if a  j 2 oc­
curs, az  |  will either glitch or not fire. This does not create 
an observable failure because signals bz and ab are asserted 
holding z high, and the output will not lower until b j  and bz  j  
occur. Hence the additional “strong” burst-mode RT constraints 
az  t  -< a  |  and bz  f  -< b j  are unnecessary.
5) C-Element Summary: Table II summ arizes the five alter­
native designs from  Figs. 2 and 4. The circuits are all sized near 
the optim al power/perform ance point. All designs were sim u­
lated to drive the same load. If a circuit is hazard-free in an SI 
environment, then no tim ing is required for correct operation. 
The SIC is slower than all others. Applying the RTA2 assum p­
tion to this design leads to a circuit (SIC-RT) that is half the 







Fig. 6. FIFO block diagram containing three cells.
static SC requires the largest circuit and is fast, but doubles the 
power. The reduced domino C-elem ent (GC-RT) improves fall 
tim es over the GC circuit by 50% (due to sim plification of the 
pullup stack) and rise tim es by 5%.
Static circuits tend to expend more energy than domino cir­
cuits. This is largely due to the extra switching activity in the 
static designs, as can be observed by the SC circuit, which ex­
pends twice the power of the SIC circuit because all four gates 
toggle for every output transition. W hen activity factors are sim ­
ilar, the domino circuits have a slight edge. The GC-RT circuit 
uses only 3% less energy than the static SIC-RT circuit because 
the reduced device sizes in the domino gates are offset by the 
short circuit current when the gates switch. Testability was m ea­
sured in COSM OS using a functional test methodology, where 
only valid tim ed signal orderings allowed by the environm ent 
can be supplied to the circuit. The table shows that the static 
and SI circuits are fully testable for com plete patterns, but not 
when tim ing constraints reduce signal interleavings (in column 
RTA2). The RT optim ized versions of these circuits are fully 
testable.
V. T im ing  Evolution  in  a  R ing
In this section, we trace the development of a simple first- 
in-first-out (FIFO) controller, similar to a micropipeline [35]. 
These controllers can be connected in series as shown in Fig. 6. 
This circuit is a sim plified abstraction of a part of the RAPPID 
design [1] and closely follows the actual steps used to derive 
the final circuit. We begin with a speed-independent design and 
review a succession of progressively sim pler circuits, enabled 
through careful application of relative tim ing assumptions.
A. Speed-Independent FIFO Cell
A simple FIFO cell can be specified in CCS as follows:
LEFT = l i |  -c.loT - l i l  - lo j  .LEFT 
RIGHT =  c . r o t  . i± T  r o j  . r i  j  .RIGHT 
FIFO =  (LEFT | RIGHT) \  { c }. (1)
The specification in (1) consists o f two handshake processes, 
LEFT and RIGHT. The c event synchronizes the two processes 
so that r i  m ust go low and l i  m ust rise before both processes 
may proceed. This process-based specification is equivalent to 
the Petri-net of Fig. 7.
The SI circuit in Fig. 8 was synthesized using Petrify [2] and 
is a hazard-free implem entation of (1).
B. Burst-M ode FIFO Cell
The SI F IF o  pays a considerable delay penalty to achieve 
speed independence. The trace l i  T> y l  T> 1° T>y2 T, ro  T 
shows that l i  t  produces lo  f  after two com plex gate and in­
verter delays and ro  |  after four. Perhaps the perform ance can 
be improved if  the circuit can ensure that concurrent outputs are
STEVENS et al.: RELATIVE TIMING 135
Fig. 7. FIFO specification Petri-net.
FIFO A RTA8 A RTA9
Fig. 9. FIFO specification Petri-net, with RT assumptions RTA8 and RTA9 
represented as dashed arcs.
Fig. 8. Speed-independent FIFO cell (SI).
generated faster than they can be acknowledged by the environ­
ment. This assumption can be form ulated as follows:
RTA8: lo  lA  T 
RTA9: ro  l i  !  •
A new specification is generated by adding these two relative 
tim ing assumptions to (1)
Fig. 10. Relative timed burst-mode FIFO (RT-BM).
(2)
where FIFO is the specification from  (1). This is equivalent to 
the Petri-net of Fig. 9, where the dashed arrows represent rela­
tive tim ing constraints.
Note that the two constraints RTA8 and RTA9 are in a form 
where outputs precede inputs and these outputs are concurrently 
enabled from  the same pair of inputs. This is a burst-mode con­
straint where the input burst is { l i  T lA  !}  and the output burst 
is { lo  |  r o  j} . This burst-mode tim ing assumes that the vari­
ance in the generation of the concurrent outputs is always less 
than the response tim e of the environm ent.2
The RT-BM  circuit of Fig. 10 is derived in [20] using the new 
RT synthesis capabilities o f Petrify and implements (2). (The 
C-elem ent here is synthesized as an OR gate in [20].) RT ver­
ification by Analyze extracts the tim ing in the physical circuit 
and creates additional orderings that m ust hold for the circuit to 
operate correctly
RTC10: x l~< lo  I 
R T C l l : x j ^ r o  j .
2Also applying burst-mode constraints on input set {M j r i  | } results in a 
C-element—the micropipelines implementation.
Fig. 11. Petri-net for circuit of Fig. 10 and constraints RTC10-RTC11.
These constraints, as well as the state variable x, are 
shown graphically in Fig. 11. The burst-mode implem entation 
achieves a 2.6 x average speedup over the SI circuit. Constraints 
RTC10-RTC11 apply only to the physical im plem entation and 
m ust be validated given actual circuit delays.
C. R ight Before Left
A ssum e that we connect the circuit of (2) into a ring with 
a single token. The token will always arrive at an idle cell 
due to circuit delays if the ring is sufficiently large. Hence the 
handshake in process RIGHT will always com plete before a 
new handshake in process LEFT. The SI or RT-BM circuits can 
safely be used in a large ring. However, the global tim ing of 
RTA12 can improve the circuit in term s of power, perform ance, 
area, and testability
R T A 1 2 :ri J.-< l i  T
this assumption can be graphically represented as shown in 
Fig. 12. The arcs from  r i  J. to r o  f  and l o t  are now redundant 
and have been rem oved from  the figure.
136 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 11, NO. 1, FEBRUARY 2003
Fig. 12. Net representing addition of RT assumptions H ].-< li f .
Fig. 14. Aggressive relative timed FIFOs.
Fig. 15. Shuffled aggressive relative timed FIFO cells.
Fig. 13. Aggressive relative timed FIFO (RT-Agr).
The dashed arcs are not causal arcs; r i  m ust go low before l i  
can rise and r i  cannot delay l i .  This represents a major change 
in the operation o f the circuit; the LEFT process is no longer 
synchronized directly with the RIGHT process except through 
system timing. The design m ust guarantee that the token appears 
on the dashed arc before l i  | .
The circuit in Fig. 13 can be synthesized with 3D and Petrify 
from  (2) adding assumption RTA12. The rising edge of signal 
l i  m ust be delayed sufficiently through lo  and the buffer to en­
sure that the domino AND gate is not disabled before it is fully 
set. This results in a num ber of RT constraints on races in the cir­
cuit that can be derived as was done for RTC4-RTC7 in the SC 
circuit. This circuit shows 15% and 3 x im provem ent in average 
case perform ance over the RT-BM  and SI circuits, respectively, 
and energy is also improved by factors of 26% and 1.9 x .
D. Pulse-M ode FIFO Cell
RTA12 now constrains the specification sufficiently to derive 
a pulse-m ode circuit. Through transitivity, ro  j  m ust precede 
l i  t- We can use this weaker constraint to discard r i ,  the back­
ward handshake signal, altogether. We show how this can be 
accom plished through transform ations on the circuit of Fig. 13.
Three elements of the ring are shown in Fig. 14. Observe 
that the lo  signal is nothing more than a delayed version of 
the l i  signal. Shuffling the lo  devices and bubbles results in 
the circuit o f Fig. 15, which has only forward-moving signals 
w ithout any intercellular feedback. The shuffling that removes 
acknowledgm ent is directly based on RTA12 that dissociates the 
LEFT process from  the RIGHT. This shuffling turns output lo  
and input into local signals.
Note that signal l i  in Fig. 15 is ju st l i  inverted. A transition 
l i  |  creates a short period when both l i  and l i  are high, which 
will set the output o f the domino AND gate. The duration of both 
inputs to the domino AND gate’s being high depends on the delay 
in the l i  path. This signal pair can be com bined into a single 
wire l i  if  the signal on this wire operates as a pulse. The final 
circuit derivation can be seen in Fig. 16.
Fig. 16. Relative timed pulse-mode FIFO (pulse).
Fig. 17. Four cycle and pulse handshake protocol constraints.
The following specification removes the direct handshake 
signals lo  and r i  o f (1) and adds RTA12
LEFTP = l i  t  .c .l i j .L E F T P  
RIGHTP =  c . r o t  .ro j.R IG H T P  
PULSE =  (LEFTP | RIGHTP) \{ c }
A ro  l i  t  . (3)
Designing reliable pulse-m ode circuits is very difficult [36]. 
We can observe some o f the constraints o f pulse circuits by 
understanding how we have derived the pulse-m ode circuit in 
this example. Fig. 17 shows a four-phase request-acknowledge 
handshake. Constraints 1-4 are causal with speed-independent 
signaling. By rem oving the ac k  signal ( lo  and r i  in Fig. 14), 
we are left with only the request signal that requires constraints 
2p and 4p. These constraints contain both m inim um  and m ax­
im um  metric bounds. However, the actual requirem ents for the 
size of these bounds can be represented with relative tim ing arcs 
between the inputs and outputs of a pulse-m ode circuit ( l i  and 
ro  in Fig. 16). Interestingly, these arcs correspond to a protocol 
very similar to the standard request acknowledge handshaking.
The pulse on l i  o f Fig. 16 causes the output pulse ro , as re­
quired by (3). If  we map r e q  to l i  and ac k  to ro  in Fig. 17, we 
see that arc 1 is causal. However, this circuit can fail if  the pulse 
is so short that the ro  (a c k )  pulse does not occur. We can there­
fore impose an RT constraint that requires ro  f  (ack  j )  before 
l i  I  ( r e q  J,). This makes arc 2 in Fig. 17 an RT constraint, 
and slightly restricts the specification. (It may be possible to not 
restrict the specification if  an internal signal toggles, which en­
sures that the domino gate has changed state.) The circuit will
STEVENS et al.: RELATIVE TIMING 137
Comparison of FIFO Implementations. All Delays Are in Terms of SI Circuit Worst Case Delay, Energy in Terms of SI Circuit. Energy 



















SI Yes 1.00 0.67 1.00 42 88% 79% n/a
RT-BM No 0.34 0.26 0.81 32 80% 77% n/a
RT-Agr No 0.34 0.22 0.53 18 n/a 100% n/a
Pulse No 0.22 0.22 0.32 15 n/a n/a 100%
Fig. 18. SI tag unit. Assumes TAGIN (ti) handshakes are mutex.
also fail if  the l i ( r e q )  pulse is too long. If  ro  { (ack  j )  and y ] 
have occurred before l i  j  ( r e q  j ) ,  then an additional pulse on 
ro  m ight be generated. Therefore, arc 3 in Fig. 17 is a necessary 
RT constraint for the circuit to work. Finally, arc 4 is assumed 
to hold from  RTA12, which drove this example. We therefore 
have a system of causal and relative tim ing relations that m ust 
hold in the pulse-m ode circuit that directly mim ic a four-phase 
handshake.
E. Ring Summary
Some consequences of evolving a simple FIFO-like con­
troller from  a speed-independent to a pulse-m ode circuit are 
sum m arized in Table III. The different circuits are charac­
terized in terms of perform ance, power, area, and testability. 
The worst case latency of the SI circuit is from  three to five 
tim es longer than the circuits that use timing. The SI circuit 
is not fully testable, and the testability degrades as the circuit 
is placed in an environm ent where concurrency is restricted. 
The m ore aggressive tim ing assumptions tend to increase 
the perform ance of the circuits, reduce the area and power, 
and increase functional testability. Note that the bulk of the 
im provem ent in perform ance has been achieved with the simple 
burst-mode transform ation; simple tim ing assumptions can 
often have significant impact on the quality of the circuit. The 
additional savings awarded by going to pulse mode are much 
less pronounced, except that the variation is eliminated. Indeed, 
the “aggressive” RT controller may already be considered a 
pulse-m ode circuit. Power is improved for each transformation, 
as the pulse circuit shows a 40%  reduction over RT-Agr. We 
feel that functional testability is increased using relative timing 
because many of the redundant coverings are rem oved when 
the circuits are optim ized for time.
VI. Tag  U nit  Exam ple
The FIFO ring is a sim plified example used for illustration. 
Typically, such an application would have synchronizations 
com ing from  m ultiple paths. The tag unit example from  RAPPID
[1] shows how relative tim ing can be em ployed to generate 
extremely high-perform ance pulse-m ode implementations.
D ecoding of variable-length instructions is inherently a serial 
process, since the length of any instruction directly depends on 
the lengths of all previous instructions since the last branch. The 
perform ance of decoding variable length instructions directly 
depends on how fast this serial process operates [1]. A critical 
com ponent in RAPPID is the tag unit. The tagging control sig­
nals interconnect the tag units to form  a 4 x 16 torus, synchro­
nizing the serial ordering of instructions by passing a tag along 
the toroidal rings.
Fig. 18 shows a single speed-independent tag unit. An input 
tag arrives on at most one of the inputs t i l - t i 7 . The tag is syn­
chronized with i r d y  and steered to one of t o l - t o 7  depending 
on instruction length 11 -17 . A b u f  r e q ,  i r d y  ack , and the cor­
responding t i a  are also issued concurrently. The four-input C4 
allows four processes to com plete their four-cycle handshake 
concurrently and begin a new transfer when all interfaces are 
synchronized. The three behaviors in the boxes are specified as 
follows:
PA = r  j  . s r  T -sa  T *(s r  I  -sa  I  | a  |  . r  j ) . a  j  .PA 
PB = s r  t  -sa  T *(s r  I  *sa J. | r  |  .a  t ) . r  J. .a  J. .PB 
C4 = (g o 0  | g o l  | go2 | g o3 ).sa .C 4 .
The two passive PA processes synchronize the four-phase 
handshake after r  requests are received, while the two PB pro­
cesses are active and synchronize before handshaking. There-
138 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 11, NO. 1, FEBRUARY 2003
fore, when the t i  and i r d y  requests arrive and the b u f  r e q  and 
t o  cycles have completed, the t i  and i r d y  signals will be ac­
knowledged and the t o  and b u f  r e q  cycles will start. This is 
accom plished in the specification by renam ing the signals and 
com posing the processes as follows:
IRDY = P A [ i r d y / r ,  i r d y a c k /a ,  goO /sr]
TAGIN = P A [ t i / r .  t i a / a ,  g o l / s r ]
TAGOUT = P B [ to / r ,  t o a / a ,  g o 3 /s r ]
BUFREQ = P B [ b u f r e q /r ,  b u f a c k /a ,  g o 2 /s r ]
TAGUNIT = (IR D Y  | TAGIN | TAGOUT | BUFREQ
I C 4)\{g°0 , g o l , go2, go3, sa}. (4)
The SI im plem entation of these processes using ATACS is 
shown in Fig. 19. Processes PA and PB result in very efficient 
im plementations. However, the large OR gates, C-elements, and 
necessity o f passing through three state machines from  the input 
to the output of the tag path create significant latency in this 
implementation.
The circuit used in RAPPID is shown in Fig. 20. This efficient 
circuit is very similar to the pulse FIFO (Fig. 16) derived in 
Section V. The extra gates are used to steer the tag paths ( t i  to 
to )  based on the instruction length and to synchronize with the 
instruction issue buffers. The backward handshake signals in the 
tag path have been removed, and the forward-going signals are 
pulses. The request and acknowledge protocols on the i r d y  and 
b u f  r e q  paths are com binations of four-phase and pulse-m ode 
signaling, with i r d y a c k  and b u f  r e q  being pulses
2p R T A 13:{buf r e q  i r d y a c k  f}  -< to  j
2p R T A 14:{ to  i r d y a c k  f}  -< b u f  r e q  j
2p R T A 15:{ to  b u f  r e q  |}  -< i r d y a c k  j
3 R T A 1 6 :ti  t o  J.
3 R T A 1 7 :ti  b u f r e q  j
3 R T A 1 8 :ti  i r d y a c k  J.
4 R T A 1 9 :{ b u freq , b u fa c k  f , i r d y a c k ,  i r d y  j}
^  t i  t
4 R T A 20:{ to , b u f re q ,  b u fa c k  f , b a  j}  -< i r d y  f  
RTA21: i r d y a c k  i r d y  j  
R T A 2 2 :b u fre q  b u fa c k  j  
TAGS = b l .t i  |  . c l . ( t i  J.| c 2 .to  t  -to  j.).TAGS 
BUF = c l .c 2 .buf r e q
.(b u f re q  | b u f a c k .b u f a c k ).BUF 
IRDY =  i r d y .  (b2 .c2 . i r d y a c k
.( i r d y a c k .  | ird y ).IR D Y  
+  n o t t . i r d y .n o t t .IRDY)
M UTEX = (b l .b 2  +  n o t t . n o t  t) .M U T E X  
TAG =(TA G S I BUF I IRDY | M UTEX)
\ { c l ,  c2, M , b2, n o t t }
A RTA13 -  RTA22. (5)
The specification for the RAPPID tag circuitry is shown in (5). 
The processes are behavioral pulse-based specifications without
Fig. 19. Speed-independent tag unit circuits: (a) PA, (b) PB, (c) C4.
Fig. 20. Simplified RAPPID tag unit.
timing. For example, the lowering edge of the pulse signal t i  [ 
and the output pulse in process TAGS are concurrent. The timing 
assumptions necessary to create a failure-free circuit can be 
classified by type according to Fig. 17. Type 4 assumptions 
on the t i  and t o  signals are encoded into the specification 
since the TAGIN and TAGOUT processes have been combined. 
The synchronizations c l  and c2 encode causal transitions of 
type 1. RTA13-RTA15 encode type 2p transitions— m inim um  
pulse-w idth constraints on to ,  b u f re q ,  and i r d y a c k .  A ssum p­
tions RTA16-RTA18 are type 3, ensuring that the input pulse 
lowers before the output pulse. RTA19 and RTA20 are type 
4 assumptions, which require the logic to stabilize before the 
next TAGIN arrives. Assumptions RTA21 and RTA22 simply 
constrain the ordering of the pulsed handshake signals. (Such 
constraints easily could have been placed in the specification, 
but have been included as RT assumptions because they are 
guaranteed by tim ing rather than by a causal relation.)
Equation (6) shows the com plete set of RT constraints placed 
on the circuit and system for the sim plified RAPPID im plem en­
tation to be valid. These constraints were generated and verified 
through Analyze [27]. RTC23 and RTC24 are the type 2 con­
straints, RTC25-RTC27 are type 3 (the same as RTA16-RTA18 
in the specification), RTC28-RTC31 the type 4 constraints, and 
type 4p RTC32-RTC33 constraints. Note that a single delay 
path constraint may include several RT constraints as we have 
used them  here
2 R TC 23:to t a g lo c a l  J.
2 R T C 24 :{ irdyack  | , t o  | , t l  j}  -< rd y  j
3 R T C 2 5 :ti to  |
3 R T C 2 6 :ti b r  j
3 R T C 2 7 :ti i rd y a c k  J.
4 R TC 28:rdy t a g lo c a l  |
4 R TC 29:rdy ba T
4 R T C 3 0 :{ ta g lo c a l L t l  t}  -«< t i  t
4 R T C 3 1 :ta g lo c a l J.-< rd y  |
4p RTC32:{ba t 5ba  [ }  -< i r d y  |
4p R T C 3 3 :ta g lo ca l J.-< t l  |  . (6)
STEVENS et al.: RELATIVE TIMING 139
TABLE IV
Comparison of RAPPID Tag Unit With the SI Version. Cycle Time 
of SI Circuit Is Base Case for Delays. Area Is the Number of 
Transistors; Testability Refers to the Complete RAPPID Tag 
Unit and Steering Logic
Cycle Cycle A^ea RAPPID
Circuit Latency Time Energy # Trans. Testability
~Sl : 053 LOO LOO 297 : n/a
RAPPID 0.21 0.39 0.54 97 98.6%
W hile the circuit o f Fig. 20 may be easier to verify using the 
metric tim ing of ATACS, we feel that explicitly attaching many, 
if  not all, of the tim ing constraints as RT predicates makes the 
specification and circuit tim ing requirem ents m ore perspicuous. 
Each interface has a simple behavioral definition, which is 
refined by tim ing assumptions as predicates. Incorporating the 
assumptions into the specification removes much of the clarity 
of the resulting synchronizations and orderings. Representing 
the com plete behavior constraints or tim ing constraints as 
a Petri-net, as was shown in Section V, can be elucidating 
for understanding small examples, but can be confusing and 
impractical for larger, real-world examples such as the tag 
unit in RAPPID. This is particularly the case for pulse-based 
implem entations where the set o f tim ing constraints can be 
quite large.
Table IV com pares the two implementations. The RT circuit 
has a 3 .1x  area, 1 .9x power, and 2.5 x latency and throughput 
im provem ent over the speed-independent circuit. Since this cir­
cuitry is in the critical path o f the RAPPID length decoder, the 
im provements in this example directly resulted in improvements 
to RAPPID [1]. The area im pact on RAPPID from  the RT cir­
cuit is arguably much higher than the transistor count com par­
ison since this circuit is w ire-lim ited and can be scaled. If  slow 
parts are used, higher scaling factors m ust be em ployed to m eet 
the target perform ance. If  the slower SI tag unit had been used 
in RAPPID, the area would have ballooned significantly to m eet 
the perform ance goals. The area savings in terms of the 50% re­
duction in wire count from  removing the backward handshake is 
also significant. Since RAPPID tagging uses point-to-point sig­
naling connected in a torus, rem oving the backward acknowl­
edgm ent path resulted in a savings of 14 wires per tag unit. This 
reduced the network bisection o f the tag logic by a total o f 224 
tag wires.
VII. Conclusion
The development of circuits requires correct operation in two 
dom ains— behavioral and temporal. Our experiments indicate 
that the design, synthesis, and verification o f circuits can be 
significantly enhanced if both tem poral and behavioral domains 
can be explicitly represented and merged. Relative tim ing is a 
means of com bining behavioral and tem poral inform ation. The 
state space o f the untim ed circuit is reduced by removing un­
reachable relative signal orderings that are induced through time 
constraints.
Relative tim ing is a useful way of reasoning about designs. 
The waveforms in databooks are presented in such a way as to 
highlight the relation between signals and transitions. One can 
use relative tim ing to architect systems, as well as synthesize 
controllers, and verify the correctness of systems. Synthesis and
verification algorithms can be designed to directly support this 
concept, where tim e is represented as a relationship similar to a 
behavioral or causal relation.
RT can be applied as aggressively or conservatively as de­
sired. Races due to the environm ent in burst-mode and in speed- 
independent implem entations due to inverter delays can be dis­
covered and explicitly listed with the circuit. Indeed, relative 
tim ing is a superset of asynchronous methodologies such as DI, 
SI, and burst mode.
Relative tim ing does not preclude metric or absolute timing. 
M etric tim ing m ust eventually be applied in the implem entation 
against the RT constraints to prove that they hold. Further, many 
of the RT constraints require a certain am ount of slack, or setup 
and hold times, in the precedence relations. The robustness and 
reliability o f the circuits can depend directly on the am ount of 
slack on the RT constraints.
The quality of the RAPPID results in terms o f throughput, 
power, area, testability, and latency was largely due to the tim ing 
em ployed in the circuits [1]. This benefit is shown through ap­
plying relative tim ing to the examples in this paper, and in the 
early tools that have form alized some of these translations.
A cknow ledgm ent
The authors are grateful for the helpful and constructive com ­
ments from  the referees. H. Hulgaard and S. Burns participated 
in tim ing verifications. J. Cortadella and M. Kishinevsky were 
the first to introduce automatic RT into the CAD tool Petrify. 
P. Beerel and H. Kim have been key contributors to RT verifi­
cation and optimization.
REFERENCES
[1] K. Stevens, S. Rotem, R. Ginosar, P. Beerel, C. Myers, K. Yun, R. Kol, C. 
Dike, and M. Roncken, “An asynchronous instruction length decoder,” 
IEEE J. Solid-State Circuits, vol. 36, pp. 217-228, Feb. 2001.
[2] J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, and A. 
Yakovlev, “Petrify: A tool for manipulating concurrent specifications 
and synthesis of asynchronous controllers,” IEICE Trans. Inform. Syst., 
vol. E80-D, no. 3, pp. 315-325, 1997.
[3] S. M. Nowick, “Automatic synthesis of burst-mode asynchronous con­
trollers,” Ph.D. dissertation, Dept. of Computer Science, Stanford Univ., 
1993.
[4] K. Y. Yun, “Synthesis of asynchronous controllers for heterogeneous 
systems,” Ph.D. dissertation, Stanford Univ., 1994.
[5] C. J. Myers, “Computer-aided synthesis and verification of gate-level 
timed circuits,” Ph.D. dissertation, Dept. of Electrical Engineering, Stan­
ford Univ., 1995.
[6] K. J. Nowka and T. Galambos, “Circuit design techniques for a giga­
hertz integer microprocessor,” in 1998 IEEE Int. Conf. Computer De­
sign: VLSI in Computers & Processors (ICCD98), Oct. 1998, pp. 11-16.
[7] D. Sager, G. Hinton, M. Upton, T. Chappell, T. D. Fletcher, S. Samaan, 
and R. Murray, “A 0.18 //m CMOS IA32 microprocessor with a 4 GHz 
integer execution unit,” in Int. Solid State Circuits Conf., Feb. 2001, pp. 
324-325.
[8] S. Schuster, W. Reohr, P. Cook, D. Heidel, M. Immediato, and K. 
Jenkins, “Asynchronous interlocked pipelined CMOS circuits operating 
at 3.3-4.5 gHz,” in Int. Solid State Circuits Conf., 2000, pp. 292-293.
[9] C. L. Seitz, “System timing,” in Introduction to VLSI Systems, C. A. 
Mead and L. A. Conway, Eds. Reading, MA: Addison-Wesley, 1980, 
ch. 7.
[10] D. E. Muller and W. S. Bartky, “A theory of asynchronous circuits,” in 
Proc. Int. Symp. Theory of Switching, Apr. 1959, pp. 204-243.
[11] S. Hauck, “Asynchronous design methodologies: An overview,” Proc. 
IEEE, vol. 83, no. 1, pp. 69-93, Jan. 1995.
[12] S. Chakraborty, “Polynomial-time techniques for approximate timing 
analysis of asynchronous systems,” Ph.D. dissertation, Stanford Univ., 
1998.
140 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 11, NO. 1, FEBRUARY 2003
[13] P. Vanbekbergen, G. Goossens, F. Catthoor, and H. J. De Man, “Opti­
mized synthesis of asynchronous control circuits from graph-theoretic 
specifications,” IEEE Trans. Computer-Aided Design, vol. 11, pp. 
1426-1438, Nov. 1992.
[14] C. J. Myers, T. G. Rokicki, and T. H.-Y. Meng, “POSET timing and its 
application to the synthesis and verification of gate-level timed circuits,” 
IEEE Trans. Computer-Aided Design, vol. 18, pp. 769-786, June 1999.
[15] W. Belluomini and C. J. Myers, “Timed circuit verification using TEL 
structures,” IEEE Trans. Computer-Aided Design, vol. 20, Jan. 2001.
[16] R. Alur and D. L. Dill, “A theory of timed automata,” Theoret. Comput. 
Sci., vol. 126, no. 2, pp. 183-235, 1994.
[17] H. Hulgaard, “Timing analysis and verification of timed asynchronous 
circuits,” Ph.D. dissertation, Dept. of Computer Science, Univ. of Wash­
ington, 1995.
[18] R. Negulescu and A. Peeters, “Verification of speed-dependences in 
single-rail handshake circuits,” in Proc. Int. Symp. Advanced Research 
in Asynchronous Circuits and Systems, 1998, pp. 159-170.
[19] S. Chakraborty, K. Y. Yun, and D. L. Dill, “Practical timing analysis of 
asynchronous systems using time separation of events,” in Proc. IEEE 
Custom Integrated Circuits Conf., May 1998.
[20] J. Cortadella, M. Kishinevsky, S. M. Burns, A. Kondratyev, L. Lavagno, 
K. S. Stevens, A. Taubin, and A. Yakovlev, “Lazy transition systems and 
asynchronous circuit synthesis with relative timing assumptions,” IEEE 
Trans. Computer-Aided Design, vol. 21, pp. 109-130, Feb. 2002.
[21] P. Day and J. V. Woods, “Investigation into micropipeline latch design 
styles,” IEEE Trans. VLSI Syst., vol. 3, pp. 264-272, June 1995.
[22] S. B. Furber and P. Day, “Four-phase micropipeline latch control cir­
cuits,” IEEE Trans. VLSI Syst., vol. 4, pp. 247-253, June 1996.
[23] S. S. Appleton, S. V. Morton, and M. J. Liebelt, “Two-phase asyn­
chronous pipeline control,” in Proc. Int. Symp. Advanced Research in 
Asynchronous Circuits and Systems, Apr. 1997, pp. 12-21.
[24] C. Myers, “Timed circuits: A new paradigm for high-speed design,” in 
Proc. Asia and South Pacific Design Automation Conf., Feb. 2001.
[25] H. Kim, P. A. Beerel, and K. S. Stevens, “Relative timing based veri­
fication of timed circuits and systems,” in Proc. Int. Symp. Advanced 
Research in Asynchronous Circuits and Systems, Apr. 2002.
[26] A. J. Martin, “Programming in VLSI: From communicating processes 
to delay-insensitive circuits,” in Developments in Concurrency and 
Communication. ser. UT Year of Programming Series, C. A. R. Hoare, 
Ed. Reading, MA: Addison-Wesley, 1990, pp. 1-64.
[27] K. S. Stevens, “Practical verification and synthesis of low latency asyn­
chronous systems,” Ph.D. dissertation, Univ. of Calgary, Calgary, Alta., 
Canada, 1994.
[28] R. Milner, “Communication and concurrency,” in Computer Sci­
ence. London, U.K.: Prentice-Hall, 1989.
[29] M. Shams, J. C. Ebergen, and M. I. Elmasry, “Modeling and comparing 
CMOS implementations of the C-element,” IEEE Trans. VLSI Syst., vol.
6, pp. 563-567, Dec. 1998.
[30] S. M. Nowick and D. L. Dill, “Exact two-level minimization of 
hazard-free logic with multiple-input changes,” IEEE Trans. Com­
puter-Aided Design, vol. 14, pp. 986-997, Aug. 1995.
[31] T.-A. Chu, “Synthesis of self-timed VLSI circuits from graph-theoretic 
specifications,” Ph.D. dissertation, Massachusetts Institute of Tech­
nology, Cambridge, 1987.
[32] T. Murata, “Petri nets: Properties, analysis and applications,” Proc. 
IEEE, vol. 77, pp. 541-580, Apr. 1989.
[33] W. S. Coates, A. L. Davis, and K. S. Stevens, “Automatic synthesis of 
fast compact self-timed control circuits,” in IFIP Working Conf. Design 
Methodologies, Apr. 1993, pp. 193-208.
[34] D. L. Dill, “ACM distinguished dissertations,” in Theory for Automatic 
Hierarchical Verification of Speed-Independent Circuits. Cambridge, 
MA: MIT Press, 1989.
[35] I. E. Sutherland, “Micropipelines,” Commun. ACM, vol. 32, no. 6, pp. 
720-738, June 1989.
[36] V. Narayanan, B. A. Chappell, and B. M. Fleischer, “Static timing anal­
ysis for self resetting circuits,” in Int. Conf. Computer-Aided Design 
(ICCAD-96), Nov. 1996, pp. 119-126.
Kenneth S. Stevens (S’83-M’84-SM’99) received the B.A. degree in biology 
in 1982 and the B.S. and M.S. degrees in computer science from the Univer­
sity of Utah, Salt Lake City, in 1982 and 1984, respectively. He received the 
Ph.D. degree in computer science from the University of Calgary, AB, Canada, 
in 1994.
From 1984 to 1991, he has held research positions at the Fairchild/Schlum- 
berger Laboratory for AI Research, the Schlumberger Palo Alto Research Lab­
oratory, and Hewlett-Packard Laboratories, Palo Alto, CA. He became an As­
sistant Professor at the Air Force Institute of Technology, Dayton, OH, in 1994. 
Since 1996, he has been an Adjunct Professor. Since 1996, he has been with 
Intel Corporation’s Strategic CAD Labs, Hillsboro, OR. His primary exper­
tise includes asynchronous circuits, VLSI, architecture, hardware synthesis and 
verification, and timing analysis. He has received seven patents and has been 
the principal author for three papers that received the best paper award and has 
served on technical program committees for conferences and workshops.
Ran Ginosar (S’79-M’82) received the B.Sc. degree in electrical engineering 
and computer engineering (summa cum laude) from The Technion—Israel Insti­
tute of Technology, Haifa, in 1978 and the Ph.D. degree in electrical engineering 
and computer science from Princeton University, Princeton, NJ, in 1982.
After working with AT&T Bell Laboratories for one year, he joined the 
Faculty of The Technion in 1983. He was a Visiting Associate Professor with 
the University of Utah in 1989-1990 and a Visiting Faculty Member with the 
Strategic CAD Lab at Intel in 1997-1999. He is the Head of the VLSI Systems 
Research Center at The Technion. His research interests include asynchronous 
systems and electronic imaging.
Shai Rotem was born in Haifa, Israel, in 1954. He 
received the B.Sc. degree from The Technion—Israel 
Institute of Technology, Haifa, in 1980.
He has been with Intel Corporation since 1980, 
in positions of VLSI design and architecture of data 
communication controllers and microprocessors, as 
well as CAD design and research in formal verifi­
cation and asynchronous design. He is currently a 
Principal Engineer in the Mobile Processor Group’s 
architecture team, responsible for future mobile 
microprocessor definition.
