Relative timing by Stevens, Kenneth & Ginosar, Ran
R e l a t i v e  T i m i n g
K e n  S t e v e n s 1, R a n  G i n o s a r 1,2, S h a i  R o t e m 1 
S t r a t e g i c  C A D  L a b s ,  I n t e l  C o r p o r a t i o n ,  H i l l s b o r o ,  O R  
2 V L S I  S y s t e m s  R e s e a r c h  C e n t e r ,  T e c h n i o n ,  H a i f a ,  I s r a e l
Abstract
Relative Timing is introduced as an informal method 
for aggressive asynchronous design. It is demonstrated on 
three example circuits (C-Element, FIFO, and RAPPID Tag 
Unit), facilitating transformations from speed-independent 
circuits to burst-mode, relative timed, and pulse-mode cir­
cuits. Relative timing enables improved performance, area, 
power and testability in all three cases.
1. Introduction
The design of RAPPID, the asynchronous instruction 
length decoder, took more than two years to complete [13]. 
Beyond investigating whether asynchronous design could 
improve performance, we also wanted to find out which de­
sign styles and circuit families are most suitable for aggres­
sive circuit design.
We started with Speed Independent (SI) and Extended 
Burst Mode (XBM) specifications. However, existing syn­
thesis tools [5, 17] yielded results that were less than sat­
isfactory for critical paths. Next, we turned to timed de­
sign and employed a metric timing synthesis tool [9]. The 
resulting circuits demonstrated improved performance but 
were still below our expectations. Therefore, we turned to 
aggressive manual design for the critical paths and man­
aged to obtain the results reported in [13]. Now we face 
the question of how our method of semi-manual design can 
be turned into an effective CAD methodology and tools.
In retrospect, one approach stands out as the most suc­
cessful method in that process. We employed Relative Tim­
ing (RT) assumptions to specify and argue about our cir­
cuits, applied certain transformations that preserved relative 
timing, and validated that the relative timing assumptions 
held in the final circuits. This approach turned out to be a 
very effective method to semi-formalize the substitution of 
aggressive pulse-mode, self-resetting circuits for the origi­
nal full-handshake speed-independent ones.
We propose that a new formal methodology and tools be 
developed to support this method. In the absence of such
CAD tools, the method is quite inefficient for the design of 
large systems. This paper presents our lessons in order to 
motivate such an effort. We start with simple, contrived 
examples that demonstrate basic principles, and move to 
a RAPPID circuit which has been improved substantially 
with relative timing.
2. Motivation and description
The design of timing in digital circuits is an extremely 
difficult challenge. The conventional clocked digital design 
methodology solves this problem by decomposing the cir­
cuit into cycle-free combinational logic (CL) stages and in­
terstage clocked latches; the clock cycle is simply tuned to 
accommodate the worst-case propagation delay in the CL 
stages. The behavior of the combinational logic can be 
specified and synthesized without considering timing. De­
lay Insensitive (DI) asynchronous circuits are analogous to 
clocked CL design in the sense that both types are indepen­
dent of time -  the behavior will be correct for arbitrary gate 
and wire delay.
High-performance circuits, both clocked and asyn­
chronous, benefit from more aggressive timing methodolo­
gies. Clocked circuits can be considerably enhanced us­
ing local self-timing [12]. Timed asynchronous circuits can 
have significantly enhanced performance, but require bet­
ter understanding and modeling of circuit performance and 
delay variation.
Metric timing requires the specification of propagation 
times or ranges thereof [16, 9]. Unfortunately metric tim­
ing analysis can explode in complexity to the extent that the 
synthesis and verification of even moderately sized timed 
circuits can become intractable [1]. Metric timing typically 
needs complete characterization of all device and environ­
ment delays to achieve improvements over unbounded de­
lay models. Complete characterization of environment de­
lays as well as estimation of the latencies of the circuits to 
be synthesized seem awkward.
An alternative to metric timing allows the designer or 
CAD algorithms to specify the effect of delays in a circuit 
in terms of assertions on relative ordering of events (e.g. a
goes high before b goes low). Our application of relative 
timing is based on the unbounded delay model already used 
by most asynchronous synthesis and verification tools. SI or 
XBM specifications are easily restricted based on designer 
specified assumptions of relative signal orderings of the en­
vironment. The circuits are then designed to meet the rel­
ative orderings, or verified that the restrictions are already 
part of the delays in the system.
Many timing CAD tools and methodologies exist; asyn­
chronous design itself is a timing methodology. Order­
ing signals temporally is not novel. Metric and non-metric 
timed automata has been considered by [1, 9, 6, 11, 2, 4]. 
Component databooks include waveforms showing relative 
signal orderings. However, we do feel that the RT method­
ology used in RAPPID applies timing top-down in a novel 
way that is intuitive, flexible, creates high performance 
small low power testable circuits, and is easily supported 
by CAD.
3. RAPPID relative timing design
Once the RAPPID architecture was complete the chal­
lenge of circuit mapping began. Initial specifications were 
synthesized using full-handshake circuits. We began study­
ing the environment of many of the critical circuits to see if 
timing could be employed to reduce the number of logic lev­
els in each controller. The system architecture created en­
vironmental signal relations where the fastest arrival delays 
are large compared to the local controllers (as in the ring 
example in Sections 4.3 and 4.4). Signal orderings were 
also enforced by design. The latency of many circuits in 
RAPPID was reduced by a factor of as much as 3 through 
such timing transformations. These transformations modi­
fied many behavioral aspects of the specifications, concur­
rency in particularly. However, the essential functionality of 
the controllers -  synchronization and ordering -  remained.
Most of the RT circuits in RAPPID were designed by 
hand. This effort, while time consuming, helped us bet­
ter understand timing, timed technology mapping, and what 
types of transformations appeared most beneficial. We in­
vestigated various forms of handshaking, including proto­
cols without direct handshaking. These pulse-based proto­
cols can at times significantly improve the simplicity and 
latency of asynchronous circuits.
Most of our implementations were mapped onto domino 
library cells. Domino circuits are a restricted class of gen­
eralized C-Elements where only a single term exists in the 
reset function. The combination of state-holding and low 
transition latency of the domino gates made them the best 
circuit alternative we investigated.
A key aspect to the correct operation of the silicon was 
the verification of these timed circuits. The timing verifi­
cation tool Analyze [15] was enhanced to support relative
timing verification. The verifier was also used to generate 
a complete set of RT constraints from the critical races in 
a circuit. These constraints enforce a particular resolution 
of the races that guarantee correct operation of a circuit. 
This is shown in Section 4.2.1, where the hidden timing 
assumptions of burst-mode are explicitly derived. Timing 
assumptions in this paper are labeled RTA, whereas critical 
races that are discovered through verification and must be 
ordered for the circuit to operate correctly are labeled RTC.
Some of the hand designed RT circuits were checked for 
validity through ATACS. However, the environmental and 
local path delays in the RT assumptions were typically val­
idated with SPICE simulations.
We feel that relative timing had significant impact on the 
throughput (3x improvement), latency (2x improvement), 
and area (15% bloat) over similar logic in a commercial syn­
chronous implementation. Although harder to quantify, we 
feel that relative timing was key in achieving the 95% stuck- 
at testability in RAPPID through removing redundancies 
that naturally result through fixed signal orderings induced 
by timing.
The lack of synthesis support was a serious productiv­
ity limitation once our methodology was in place. Part of 
this aspect has been successfully addressed in joint research 
with the Petrify team by creating integrated algorithms that 
support RAPPID-style automatic RT synthesis. Many of 
the key RT controllers, including the one presented in Sec­
tion 4.4, can now be directly synthesized in Petrify.
A significant weakness in RAPPID validation was tim­
ing analysis support in the back-end. Henrik Hulgaard veri­
fied the timing of the RAPPID FIFO. A relative timing flow 
that automatically generates all essential RT constraints, 
calculates the best and worst case paths necessary for the 
constraint to hold, and completes the timing analysis for 
these paths is research yet to be completed.
We encourage researchers to further develop CAD for 
RT design.
4. Examples
4.1. N otation  a n d  term inology
Table 1 shows some notations used in this paper. For 
CCS [7], ‘.’ is the sequential operator, ‘+ ’ is the nonde- 
terministic choice operator, ‘|’ is parallel composition, and 
‘\ { a j ’ is the restriction operator applied to signal a.
All simulations have been made using standard library 
cell device sizes driving six standard inverters as a load. 
They were simulated in SPICE using the MOSIS 0.5p pro­
cess parameters. A more complete modeling of some of 
these circuits and parameters can be found in [14].
The circuit examples in this paper are all based on non­
clocked domino gates employing a single pMOS device.
Signal Description Example
input signal underline i n p u t
output signal o u t p u t
inverted (asserted low) over-bar "z
rising transition up arrow a f
falling transition down arrow h-l
Table 1. Notation conventions
Asynchronous tools such as 3D [17], ATACS [8] and Pet­
rify [5] can typically synthesize set-reset flops and the ap­
propriate functions (Figure 1(a)). We apply technology 
mapping into single-variable reset (equivalently set) func­
tions, and implement them using standard footed domino 
gates as in Figure 1(b). When the reset variable is not used 














Figure 1. (a) Set-Reset flop and functions. 
(b) Footed domino gate (symbol and circuit) 
implementing a Set-Reset flop with f r = I , 
f8 =  x x  a x  (b + c). (c) Unfooted domino gate 
implementing f r = x, f8 = a x (b +  c).
that the environment always produces transitions on a be­
fore transitions on b, and we feel this knowledge might 
simplify our circuit. This relative timing assumption is ex­
pressed as a follows:
RTA1: a b
The C-Element is reduced to a buffer: C =  b.z.C  using 
this assumption. If the assumption is limited to the falling
edges,
RTA2: a 4- -< b.4-
the reset function contains only b i ,  and the C-Element can 
be implemented as a footed domino gate (Figure 2(b)): C =  
( a t  I b f ) . z f.a j-b j.. z4-.C. With a similar assumption on the 
positive edges,
RTA3: a f  -< b f
the circuit can be mapped to the domino gate in Figure 2(c) 
by inverting the inputs and employing the non-buffered z" 
output. Alternatively, the output can be buffered for high 
loads. A “wobbly” C-Element C =  a.b .z.C  +  b .(a.z.C  +  
b.C), that is unsafe because input b may toggle and with­








4.2. C -E lem ent
A simple two-input generalized C-Element C =  (a | 
b).z.C  (as defined in CCS [7]) and its CMOS implementa­
tion are shown in Figure 2(a). Let's assume that we know
Figure 2. Generalized C-Elements: (a) gC, (b) 
GC-RT for a4- -< M  (c) for a f ^  bf
Let’s consider the static C-Element (SC) in Figure 3(a). 
This circuit is not speed-independent, but is safe provided
the environment is sufficiently slow. Alternatively, Pet­
rify [5] synthesizes the static complex gate circuit shown 
in Figure 3(c). Timing assumptions RTA2 or RTA3 lead to 
the simpler static circuits of Figure 3(d) and 3(e), respec­
tively. Note that these two circuits are actually subcircuits 




Figure 3. Static C-Elements: (a) C-Element 
with hazards, (b) locally timed, (c) Speed- 
Independent, (d) with RT assumption aj. -< bj.. 
(e) with RT assumption af -< bf
4.2.1. Relative timing verification
The SC circuit in Figure 3(a) is implicitly hazard-free under 
the fundamental mode assumption. Relative timing allows 
this assumption to be made explicitly. If, for instance, the 
environment responds quickly, bj. may immediately follow 
zf, before node az  rises. This race shows up as a failure 
when verifying the circuit against the specification. Verifi­
cation engines can be enhanced to support relative timing 
by generating a set of RT constraints from these verifica­
tion failure states. The following two explicit relative timing 
constraints on the burst-mode implementation were gener­
ated by an enhanced version of Analyze1 [15]:
RTC4: b z f  -< aj- 
RTC5: a z f  -< bj.
Valid sets of RT constraints are not necessarily unique. The 
following is another set of RT constraints that are less re­
strictive because they do not require circuit stability:
RTC6: b z f  -< abj- 
RTC7: a z f  -< abj-
1 Analyze is a bisimulation verifier. Only hazards that affect the outputs 
are reported.
These sets of RT constraints rely on delay paths through 
the environment because a z f ,  b z f ,  a i,  and bj. are all en­
abled from z . One possible implementation that can guar­
antee that these constraints hold independent of environ­
ment delays is shown in Figure 3(b), where a buffer is added 
at the output. All constraints can be made local to the cir­
cuit because the AND gates and the buffer are enabled by 
signal c. Constraints RTC4 and RTC5 can be modified to 
b c f  ■< z f  and a c f  ■< z f , which hold if the delay through 
the buffer is larger than through the AND gates.
4.2.2. C-Element summary
Table 2 summarizes the five alternative designs. Except 
for the static C-Element (SC), all implementations are 
hazard-free in their respective environments. The speed- 
independent circuit (SIC) is slower than all others. The 
relative timing assumption (SIC-RT), which leads to a half 
size circuit, also enhances performance by 30%. The static 
SC requires the largest circuit but it is also relatively fast. 
The reduced domino C-Element (GC-RT) is 15% faster to 
rise (having only a single pull-up transistor), but is actu­
ally slower than the gC on the falling edge. The speed- 
independent circuits require considerably higher switching 
energy even when applying RT assumptions. The static 
implementation without relative timing shows comparative 
power to the simpler GC and GC-RT circuits largely due 
to the short circuit current through the keepers as the GC 
circuits switch. The GC-RT circuit shows higher power 
consumption than the GC circuit because the removal of 
the pMOS device results in an additional short-circuit cur­
rent when b f  follows a f .  The table shows that the static 
and SI circuits are fully testable for exhaustive patterns, 
but not when timing reduces signal interleavings (in col­
umn RTA2). The RT optimized versions of these circuits 
are fully testable.
4.3. T im ing evolution in  a  ring
In this section we trace the development of a simple 
FIFO cell, a simplified abstraction of a part of the RAP- 
PID design [13], following closely the actual steps we have 
made. We begin with a speed-independent design, and re­
view a succession of progressively simpler circuits, enabled 
through careful application of relative timing assumptions.
4.3.1. Speed-independent FIFO cell
A simple FIFO cell can be specified in CCS as follows.
LEFT =  l i t .c . lo t . l iX . lo X .L E F T
RIGHT =  c .ro t .r it .rc 4 .r i . |. .R IG H T  (1)




















SIC Yes 1170pS 1190pS 20.2pJ 16 100% 90%
SIC-RT Yes 735pS 785pS 14.0pJ 8 n/a 100%
SC No 700pS 545pS 11.6pJ 18 100% 92%
GC Yes 640pS 585pS 11.1pJ 10 100% 100%
GC-RT Yes 530pS 600pS 11.6pJ 9 n/a 100%
Table 2. Comparison of C-Element implementations. Energy is for a complete cycle (rise and fall). 
Test columns show COSMOS stuck-at fault coverage, with reduced patterns in RTA2 column due to 
environment restrictions.
The specification in Equation (1) consists of two handshake 
processes, LEFT and RIGHT. The c signal synchronizes the 
two processes so that r i  must go low and l i  must rise be­
fore both processes may proceed. This process-based spec­




Figure 4. FIFO specification Petri-net
The circuit definition, shown in Figure 5, can be synthe­
sized from this specification using Petrify [5]. This circuit 
definition uses the complex gate assumptions where the in­
verters are zero-delay or are combined with the complex 
gates. This definition, as well as a physical circuit imple­
mentation that includes discrete inverters, can be proven to 
conform to the specification of the FIFO in Equation (1).
Figure 5. Speed-independent FIFO cell
4.3.2. Burst-mode FIFO cell
The circuit definition of Figure 5 pays a considerable delay 
penalty to achieve speed independence. Note that l o t  is 
produced after three complex gate delays, and r o  in four. 
Perhaps the performance can be improved if the circuit can
ensure that concurrent outputs are generated faster than they 
can be acknowledged by the environment. This assumption 
can be formulated as follows:
RTA8: l o f  -< r i t  




r i l r o
Figure 6. FIFO specification Petri-net with RT 
constraints RTA8 and RTA9 represented as 
dashed arcs
A new specification is generated by adding these two rel­
ative timing assumptions to the specification. The specifi­
cation can be represented as
FIFO A l o t  r i t  A r o t  ^  l i t (2)
where FIFO is the specification from Equation (1). This can 
be represented in the Petri-net of Figure 6 where the dashed 
arrows are relative timing constraints.
Note that the two relative timing constraints in RTA8 and 
RTA9 are in a form where outputs precede inputs. You can 
also note from the specification that the outputs are enabled 
concurrently from a pair of inputs. This is exactly a burst­
mode constraint [3] where the input burst is { l i t  r i j )  
and the output burst is { l o f  r o t } . This burst-mode tim­
ing, shown in Figure 7, assumes that the variance in the 
generation of the concurrent outputs is always less than the 
response time of the environment2.
Incorporating the RT assumptions RTA8 and RTA9 
directly into Specification (1) produce the Mealy state ma­
chine of Figure 8. This new form is suitable for synthesis:
2Applying burst-mode constraints on the signals { l i j .  r i t l  as well 
results in a C-Element -  the micropipelines implementation.
o o
lo l - l i t  r i X roX
I iX l o f  r o t
Figure 7. Petri-net with partial burst-mode RT
■ ® - ^ ©
Figure 8. 3D AFSM specification 2
0 1 l i t  | l o f r o f  2 4 r i f l i *  | r o f
1 2 l i X j loX  3 5 l i X r i *  j loX
1 3 r i t  j roX  4 1 r i X l i t  j l o f r o f
5 1 r i X l i f  j l o f r o f
The circuit of Figure 9 was synthesized by 3D. The 3D 
specification is not identical to Figure 7 due to the implied 
mutex transitions between l i X and r i f .  However, the syn­
thesized circuit does not require mutual exclusion and im­
plements the Petri-net behavior.
Unbounded delays in the inverters result in critical races 
which can cause the physical implementation to fail to con­
form to the specification. However, this circuit can still 
be a valid implementation for some actual device delays. 
RT verification by Analyze extracts the critical races in the 
physical circuit and creates an ordering that must hold for 




y f  -< I l f
y ^  r i  
l i f  •< r o f
The burst-mode implementation achieves a 2.8 x average 
speedup over the SI circuit. Constraints RTC10-RTC12 ap­
ply only to the physical implementation and must be vali­
dated by a timing verifier.
4.3.3. Right before left
Assume that we connect the circuit of Specification (2) into 
a ring with a single token. The token will always arrive at 
an idle cell due to circuit delays if the ring is sufficiently 
large. Hence the handshake in process RIGHT will always 
complete before a new handshake in process LEFT. The SI 
or BM circuits can safely be used in a large ring. How­
ever, if one takes advantage of the timing of the system, an 
improved circuit (in terms of power, performance, area and 
testability) can be derived. RTA13 expresses ordering due 
to timing in a large ring:
RTA13: r i X l i f
This assumption can be graphically represented as 
shown in Figure 10, where the dashed arc is the relative 
timing relation RTA13.
l o l -------^  l i t  —  5^ 3-  -  r i i  - ----------------------roX
l i X - - l o t  r o t
F i g u r e  9 .  R e l a t i v e  t i m e d  b u r s t - m o d e  F I F O
Figure 10. Net representing addition of RT as­
sumptions r if  -< l i t
The dashed arc is not a causal arc; r i  must go low be­
fore the M  can rise but r i  cannot delay l i .  This rep­
resents a major change in the operation of the circuit; the 
LEFT process is no longer synchronized directly with the 
RIGHT process except through system timing. The design 
must guarantee that the token appears on the dashed arc be­
fore loX.
The circuit in Figure 11 can be synthesized with 3D from 
Specification (2) using assumption RTA13. The rising edge 
of signal l i  must be delayed sufficiently through l o  and 
the buffer to ensure that the domino AND gate is not dis­
abled before it is fully set. This results in a number of RT 
constraints on critical races in the circuit that can be derived 
as was done for RTC4-RTC7 in the SC circuit. This circuit 
shows 1.7x and 3.6x improvement in worst case perfor­
mance over the burst-mode and SI circuits respectively, and 
energy is also improved by 1.8 x and 2.1 x .
4.3.4. Pulse-mode FIFO cell
RTA13 now constrains the specification sufficiently to de­
rive a pulse-mode circuit. Note that through transitivity,
r o
Figure 11. Aggressive relative timed FIFO
l i
Figure 14. Relative timed pulse-mode FIFO
ro.l. must also precede l i t .  We can use this weaker con­
straint to discard r i ,  the backward handshake signal, al­
together. We show how this can be accomplished through 
transformations on the circuit of Figure 11.
Figure 12. Aggressive relative timed FIFOs
r o
r o
Figure 13. Shuffled aggressive relative timed 
FIFO cell
Designing reliable pulse-mode circuits is very diffi­
cult [10]. We can observe some of the constraints of pulse 
circuits by understanding how we have derived the pulse­
mode circuit in this example. Figure 15 shows a four-phase 
request-acknowledge handshake. Constraints 1 through 4 
are causal with speed-independent signaling. By removing 
the acknowledgment signal (1 o and r i  in this case), we are 
left with only the request signal that requires constraints 2p 
and 4p. These constraints contain both minimum and max­
imum metric bounds. However, the actual requirements for 
the size of these bounds can be represented with relative 
timing arcs. Interestingly, these arcs correspond to a proto­
col very similar to the standard request acknowledge hand­
shaking.
r e q ( l i ) L 4p J
^  2p H
Three elements of the ring are shown in Figure 12. Ob­
serve that the l o  signal is nothing more than a delayed ver­
sion of the l i  signal. Shuffling the l o  devices and bubbles 
results in the circuit of Figure 13, that has only forward- 
moving signals without any inter-cellular feedback. The 
shuffling that removes acknowledgment is directly based on 
RTA13 that dissociates the LEFT process from the RIGHT. 
This shuffling turns output l o  and input r i  into local sig­
nals.
Note that signal l i  in Figure 13 is just l i  inverted. A 
transition l i t  creates a short period when both l i  and l i  
are high, which will set the output of the domino AND gate. 
The duration of both inputs to the domino AND gate being 
high depends on the delay in the l i  path. This signal pair 
can be combined into a single wire l i  if the signal on this 
wire operates as a pulse. The final circuit derivation can be 
seen in Figure 14.
The following specification removes the direct hand­
shake signals l o  and r i  of Specification (1) and adds 
RTA13:
a c k  ( r o )
LEFTP =  l i t - c . l i t .L E F T P  
RIGHTP =  c. r o t - r  O.J-RIGHTP 
PULSE =  (LEFTP | RIGHTP)\{c}
ro j- ^  l i t
(3)
Figure 15. Four cycle and pulse handshake 
protocol constraints
The pulse on l i  of Figure 14 causes the output pulse 
ro ,  as required by specification (3). If we map r e q  to l i  
and ac k  to ro  in Figure 15, we see that arc 1 is causal. 
However, this circuit can fail if the pulse is so short that 
the ro  (ack) pulse does not occur. We can therefore im­
pose an RT constraint that requires r o t  ( a c k t )  before l iX  
(r e q t) .  This makes arc 2 in Figure 15 an RT constraint, 
and slightly restricts the specification. (It may be possible 
to not restrict the specification if an internal signal toggles 
which ensures the domino gate has changed state.) The cir­
cuit will also fail if the l i  (r e q ) pulse is too long. If r o t  
( a c k t )  and y t  have occurred before l i X (r e q t )  then an 
additional pulse on ro  might be generated. Therefore, arc 3 
in Figure 15 is a necessary RT constraint for the circuit to 
work. Finally, arc 4 is assumed to hold from RTA13 which 
drove this example. We therefore have a system of causal 
and relative timing relations that must hold in the pulse­




















SI Yes 2160pS 1560pS 37.6pJ 39 98% 91% n/a
RT-BM No 1020pS 550pS 32.2pJ 40 95% 74% n/a
RT-Agr No 595pS 390pS 18.2pJ 20 n/a 100% n/a
Pulse No 350pS 350pS 16.2pJ 17 n/a n/a 100%
Table 3. Comparison of FIFO implementations. Energy accounts for a complete four-phase cycle. 
Synchronous testing in COSMOS required extra test gate for pulse circuit.
4.3.5. Ring summary
Some consequences of evolving a simple FIFO-like con­
troller from a speed-independent to a pulse-mode circuit are 
summarized in Table 3. The different circuits are character­
ized in terms of robustness, performance, power, area, and 
testability. The latency of the SI circuit is from three to five 
times longer than the circuits that use timing. The circuit is 
not fully testable, and the testability degrades as the circuit 
is placed in an environment where concurrency is restricted. 
The more aggressive timing assumptions tend to increase 
the performance of the circuits, reduce the area and power, 
and generally increase the testability. Note that the most 
significant improvements in performance, area and power 
have all been achieved by the burst-mode and aggressive RT 
transformations. The additional savings awarded by going 
to pulse mode are much less pronounced. Indeed, the ’ag­
gressive’ RT controller may already be considered a pulse 
mode circuit. We feel that testability is increased using rel­
ative timing because many of the redundant coverings are 
removed when the circuits are optimized for time.
4.4. Tag U nit exam ple
The FIFO ring is a simplified example used for illustra­
tion. Typically, such an application would have synchro­
nizations coming from multiple paths. The Tag Unit exam­
ple from RAPPID [13] shows how relative timing can be 
employed to generate extremely high performance pulse­
mode implementations.
Decoding of variable length instructions is inherently a 
serial process, since the length of any instruction directly 
depends on the lengths of all previous instructions since the 
last branch. The performance of decoding variable length 
instructions directly depends on how fast this serial process 
operates [13]. A critical component in RAPPID is the Tag 
Unit, which synchronizes the serial ordering of instructions. 
The tagging control signals interconnect the Tag Units to 
form a 4x  16 torus.
used. The three behaviors in the boxes are specified as 
follows:
PA =  r f .s r f .s a t . f s rX .s a X  | af.rX).aX-PA 
PB =  s r f .s a t .f s rX .s a X  | rf.at).rX .aX-PB 
C4 =  (goO | g o l  | go2 | g o 3 ).sa.C 4
The two PA active processes synchronize the four-phase 
handshake after r  requests are received, while the two PB 
processes are passive and synchronize before handshaking. 
Therefore, when the i r d y  and t i  requests arrive and the 
b u f r e q  and t o  cycles have completed, the t i  and i r d y  
signals will be acknowledged and the t o  and b u f r e q  
cycles will start. This is accomplished in the specification 
by renaming the signals and composing the processes as 
follows:
IRDY =  P A [ i r d y / r ,  i r d y a c k / a ,  g o O /s r ]  
TAGIN =  P A [ t i / r .  t i a / a ,  g o l / s r ]
TAGOUT =  P B [to /r . t o a /a .  g o 3 / s r ]
BUFREQ =  P B [b u fre q /r ,  b u f a c k /a . q o 2 /s r ]  
TAGUNIT =  (IRDY | TAGIN | TAGOUT | BUFREQ 
I C4)\{go0, g o l ,  go2, go3, sa}
(4)
The implementation of these processes using ATACS is 
shown in Figure 17. Processes PA and PB result in very 
efficient implementations. However, the large OR gates, C- 
Elements, and the necessity of passing through three state 
machines from the input to output of the tag path create 
significant latency in this implementation.
s r
s a  
a  a.
go0 _
Figure 17. Speed-independent Tag Unit cir­
cuits: (a) PA (b) PB (c) C4
Assume that the simplified interfaces of Figure 16 The circuit used in RAPPID is shown in Figure 18. This
are all speed-independent interfaces. This requires re- efficient circuit is very similar to the simplified FIFO de­
quest/acknowledge handshakes; a four-phase protocol is rived in Section 4.3, with the extra gates being used to steer
Figure 16. SI Tag Unit. Assumes tagin (ti) handshakes are mutex.
the tag paths based on the instruction length. The back­
ward handshake signals in the tag path have been removed, 
and the forward-going signals are pulses. The request and 
acknowledge protocols on the i r d y  and b u f r e q  paths 
are combinations of four-phase and pulse-mode signaling 
-  i r d y a c k  and b u f r e q  being pulses.
Figure 18. Simplified RAPPID Tag Unit
The specification for the RAPPID tag circuitry is shown 
in Equation 5. The processes are behavioral pulse-based 
specifications without timing. For example, the lowering 
edge of the pulse signal t i X and the output pulse t o  are 
concurrent. The timing assumptions necessary to create the 
circuit can be classified by type according to Figure 15. The 
type 4 assumptions on the t i  and t o  signals are encoded 
into the specification since the TAGIN and TAGOUT pro­
cesses have been combined. The synchronization signals 
c l  and c2 in the specification encode the causal transi­
tions of type 1. RTA14-RTA16 encode the type 2p transi­
tions -  minimum pulse-widths constraints on to ,  b u f r e q ,  
and i r d y a c k .  (When multiple signals precede another 
we can include them as a set in one constraint.) Assump­
tions RTA17-RTA19 are type 3 constraints, ensuring that 
the input pulse lowers before the output pulse. RTA20 and 
RTA21 are type 4 assumptions which require the pulses re­
turn to the stable state before the next tagin arrives. As­











of the pulsed handshake signals. (Such constraints could 
have easily been placed in the specification, but have been 
included as RT assumptions because they are guaranteed by 
timing rather than by a causal relation.)
b u f r e q  i r d y a c k  t o  
t o  i r d y a c k  b u f r e q  
t o  b u f r e q  i r d y a c k
t i X -< t o  J. 
t i X -< b u f re q X  
t i X -< i rd y a c k X
{ b u fre q , b u fa c jc f , i r d y a c k ,  i r d y X} 
-< t i t
I to .  b u f r e q .  b u f  a c k t .  baXT K i r d y  f  
ird y a c k X  -< i r d y X 
b u freq X  -< b u f a c k X
b l .t i f . c l . f t iX  | c2_.tof.toX )-TAGS 
c l .c 2 .b u f  r e q  
. ( b u f r e q  | b u f  a c k .b u f  a c k ).BUF 
ird y .fb 2 .c 2 . i r d y a c k
. ( i rd y a c k .  | irdy).IR D Y  
+  n o t  t . i r d y . n o t  t . IRDY)
(b l.b 2  +  n o tt.n o tt) .M U T E X  
(TAGS | BUF | IRDY | MUTEX)
\ { c l .  c 2 . b l . b 2 . n o t t }
A RTA14..RTA23
(5)
Equation (6) shows the complete set of RT constraints 
placed on the circuit and system for the simplified RAP- 
PID implementation to be valid. These constraints were 
generated and verified through Analyze [15]. RTC24 and 
RTC25 are the type 2 constraints, RTC26-RTC28 are type 3 
(the same as RTA17-RTA19 in the specification), RTC29- 
RTC32 the type 4 constraints, and type 4p RTC33-RTC34 
constraints. Note that a single delay path constraint may 






Tag Cycle Cycle Area RAPPID
Circuit Latency Time Energy # Trans. Testability
SI 4.75nS 9.68nS 255pJ 294 n/a
RAPPID 1.27nS 2.61nS 63pJ 85 98.6%
Table 4. Comparison of RAPPID Tag Unit with the SI version. Area is the number of transistors, 























t o f  -< t a g l o c a l X
{ i r d y a c k f ,  t o f ,  t l j - }  -< rdyX
t i X -< t o |
t i X -< brX
t i X -< i r d y a c k X
rdyX  -< t a g l o c a l f
rdyX  -< b a f
{ t a g lo c a lX ,  t i t }  -< t i t  
t a g l o c a l X  ^  r d y f  
{ b a t . baX} -< i r d y t  
t a g l o c a l X  -< t i t
(6)
We feel that attaching many, if not all, of the timing con­
straints as RT predicates make the specification more per­
spicuous as well as explicitly annotating the timing require­
ments. Each process represents an interface with a sim­
ple definition, which is refined by timing assumptions as 
predicates. Incorporating the assumptions into the specifi­
cation removes much of the clarity of the required synchro­
nizations and orderings. Representing the complete behav­
ior constraints or timing constraints as a Petri-net, as was 
shown in Section 4.3, can be illucidating for understanding 
small examples, but can be confusing and impractical for 
larger, real-world examples such as the Tag Unit in RAP­
PID. This is particularly the case for pulse-based implemen­
tations where the set of timing constraints can be quite large.
A comparison of the two implementations is made in 
Table 4. The RT circuit shows a 3.5 area, 4 power, 
and 3.7x improvement in latency and throughput over the 
speed-independent circuit. Since this circuitry is in the crit­
ical path of the RAPPID length decoder, the improvements 
in this example can fairly directly map to improvements in 
RAPPID [13]. While the area of this controller is a fraction 
of RAPPID, the area impact on RAPPID from the RT circuit 
is arguably much higher than the size of the controller. The 
RAPPID architecture can be scaled to reach a higher perfor­
mance. If slow parts are used, higher scaling factors must 
be employed to meet the target performance. If the slower 
SI tag unit had been used in RAPPID, the area would have 
ballooned significantly through scaling if the performance 
goals were to be met. The area savings in terms of the 50% 
reduction in wire count is also significant. Since RAPPID 
tagging uses point-to-point signaling connected in a torus,
removing the backward acknowledgment path resulted in a 
savings of 14 wires per tag unit. This reduced the network 
bisection of the tag logic by a total of 224 tag wires.
5. Conclusion
The development of circuits requires correct operation 
in two domains - behavioral and temporal. Our experiments 
indicate that the design, synthesis, and verification of cir­
cuits can be significantly enhanced if both temporal and 
behavioral domains can be merged. Relative timing is a 
means of combining behavioral and temporal information. 
The statespace of the untimed circuit is reduced by remov­
ing unreachable relative signal orderings that are induced 
through time constraints.
Relative timing is a useful way of reasoning about de­
signs. The waveforms in databooks are presented in such 
a way as to highlight the relation between signals and tran­
sitions. One can use relative timing to architect systems, 
as well as synthesize controllers and verify the correctness 
of systems. Synthesis and verification algorithms can be 
designed to directly support this concept where time is rep­
resented as a relationship similar to a behavioral or causal 
relation.
RT can be applied as aggressively or conservatively as 
desired. In a restricted form races in speed-independent 
implementations due to inverter delays can be discovered, 
and shown to not be critical, through relative timing. Burst­
mode constraints are an example of conservative implicit 
application of RT. Relative timing does not preclude met­
ric or absolute timing. Metric timing must eventually be 
applied in the implementation against the RT constraints to 
prove that they hold. Further, many of the RT constraints 
require a certain amount of slack, or setup and hold times, 
in the precedence relations. The robustness and reliability 
of the circuits can depend directly on the amount of slack 
on the RT constraints.
Relative timing was a large factor in the quality of the 
RAPPID results in terms of throughput, power, area, testa­
bility, and latency [13]. The benefit is shown through ap­
plying relative timing to the examples in this text.
We are grateful for the helpful and constructive com­
ments from the referees. Henrik Hulgaard and Steve Burns 
participated in timing verifications. Jordi Cortadella and 
Mike Kishinevsky were the first to introduce automatic RT 
into the CAD tool Petrify.
References
[1] R. Alur and D. L. Dill. A Theory of Timed Automata. The­
oretical Computer Science, 126(2): 183-235, 1994.
[2] S. Chakraborty, K. Y. Yun, and D. L. Dill. Practical timing 
analysis of asynchronous systems using time separation of 
events. In Proc. IEEE Custom Integrated Circuits Confer­
ence, May 1998.
[3] W. S. Coates, A. L. Davis, and K. S. Stevens. Automatic 
Synthesis of Fast Compact Self-Timed Control Circuits. In 
IFIP Working Conference on Design Methodologies, pages 
193-208, April 1993.
[4] J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, 
A. Taubin, and A. Yakovlev. Lazy transition systems: ap­
plication to timing optimization of asynchronous circuits. 
In Proc. International Conf. Computer-Aided Design (IC- 
CAD), pages 324-331, November 1998.
[5] J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, 
and A. Yakovlev. Petrify: a tool for manipulating con­
current specifications and synthesis of asynchronous con­
trollers. IEICE Transactions on Information and Systems, 
E80-D(3):315-325, 1997.
[6] H. Hulgaard. Timing Analysis and Verification o f Timed 
Asynchronous Circuits. PhD thesis, Department of Com­
puter Science, University of Washington, 1995.
[7] R. Milner. Communication and Concurrency. Computer 
Science. Prentice Hall International, London, 1989.
[8] C. J. Myers. Computer-Aided Synthesis and Verification o f 
Gate-Level Timed Circuits. PhD thesis, Dept. of Elec. Eng., 
Stanford University, October 1995.
Acknowledgments [9] C. J. Myers, T. G. Rokicki, and T. H.-Y. Meng. Automatic 
synthesis and verification of gate-level timed circuits. Tech­
nical Report CSL-TR-94-652, Stanford University, January 
1995.
[10] V. Narayanan, B. A. Chappell, and B. M. Fleischer. Static 
Timing Analysis For Self Resetting Circuits. In Interna­
tional Conference on Computer-Aided Design (ICCAD-96). 
IEEE Computer Society, November 1996.
[11] R. Negulescu and A. Peeters. Verification of speed- 
dependences in single-rail handshake circuits. In Proc. 
International Symposium on Advanced Research in Asyn­
chronous Circuits and Systems, pages 159-170, 1998.
[12] K. J. Nowka and T. Galambos. Circuit Design Techniques 
for a Gigahertz Integer Microprocessor. In 1998 IEEE Inter­
national Converence on Computer Design: VLSI in Comput­
ers & Processors (ICCD98), pages 11-16. IEEE Computer 
Society, October 1998.
[13] S. Rotem, K. Stevens, R. Ginosar, P. Beerel, C. Myers, 
K. Yun, R. Kol, C. Dike, M. Roncken, and B. Agapiev. RAP- 
PID: An Asynchronous Instruction Length Decoder. In 5th 
International Symposium on Advanced Research in Asyn­
chronous Circuits and Systems. IEEE, April 1999.
[14] M. Shams, J. C. Ebergen, and M. I. Elmasry. Modeling 
and Comparing CMOS Implementations of the C-Element. 
IEEE Transactions on VLSI Systems, 6(4):563-567, Decem­
ber 1998.
[15] K. S. Stevens. Practical Verification and Synthesis o f Low 
Latency Asynchronous Systems. PhD thesis, University of 
Calgary, Calgary, Alberta, September 1994.
[16] F. C. D. Young, K. S. Stevens, and R. P. Graham Jr. Timed 
Logic Conformance and its Application. In 1999 Interna­
tional Workshop on Timing Issues in the Specification and 
Synthesis o f Digital Systems (TAU99). ACM/IEEE, March
1999.
[17] K. Y. Yun. Synthesis o f Asynchronous Controllers for Het­
erogeneous Systems. PhD thesis, Stanford University, Aug. 
1994.
