What is the cost of delay insensitivity? by Saito, Hiroshi et al.
What is the cost of Delay Insensitivity? * 
Hiroshi Saito Alex Kondratyev Jordi Cortadella Lucian0 Lavagno Alexander Yakovlev 
Univ. of Aizu Univ. of Aizu Univ. Politknica Univ. of Udine Univ. of Newcastle 
Japan Japan Catalunya, Spain Italy upon Tyne, UK 
Abstract blocks of relatively small size, for which the designer can 
Deep submicron technology calls for new design tech- 
niques, in which wire and gate delays are accounted to 
have equal or nearly equal effect on circuit behaviour. 
Asynchronous speed-independent (SI) circuits, whose be- 
haviour is only robust to gate delay variations, may be 
too optimistic. On the other hand, building circuits totally 
delay-insensitive (DI), for both gates and wires, is imprac- 
tical. The paper presents an approach for automated syn- 
thesis of globally DI and locally SI circuits. It is based 
on order relaxation, a simple graphical transformation of a 
circuit’s behavioural specification, for which Signal Tran- 
sition Graph, an interpreted Petri net, is used. The method 
is successfully tested on a set of benchmarks and a realistic 
design example. It proves effective showing average cost 
of DI interfacing at about 40% for area and 20% for speed. 
1 Introduction 
As the scale of integration increases, managing synchro- 
nization and control of computation and communication on 
deep sub-micron (DSM) integrated circuits using a global 
clock is becoming increasingly difficult. Asynchronous 
systems, free from the clock, offer a number of potential 
advantages, such as reduced risk of synchronization fail- 
ures, low power consumption, improved noise and electro- 
magnetic compatibility to name but a few. 
Interpreted PNs (called Signal Transition Graphs 
(STGs) [2, 71) are widely used in specifying an asyn- 
chronous system behavior in a formal timing diagram style. 
It is known that from an STG one can derive an implemen- 
tation which has the speed-independent (SI) property, i.e., 
such that the behavior of the circuit is correct under any 
distribution of gate delays. The main drawback of SI cir- 
cuits is in neglecting the influence of wire delays on circuit 
behavior. For the DSM technology, where wire and gate 
delays can become (over long wires) equally important, 
the implementation should be targeted at delay-insensitive 
(DI) circuits [19], which allow wire delays to be of arbi- 
trary value. In fact, a reasonable strategy for future tech- 
nologies would require one to partition the system into 
keep control on wire delays (SI blocks) [ 10, 181, with a DI 
interface between blocks [ 141. 
Logic synthesis of hazard-free asynchronous control 
circuits from STG specifications has reached a good level 
of maturity and automation (comparable in several respects 
to that of synchronous FSMs), as exempllified by the tool 
Petrify [3]. Asynchronous CAD is being used both for in- 
dustrial and academic design experiments [12, 151. It is 
therefore most natural to introduce DI interfacing into the 
existing STG-based synthesis framework, so far supporting 
the synthesis of both speed-independent circuits and cir- 
cuits optimized using a variety of timing assumptions [4]. 
This approach clearly differs from early ideas about ex- 
ternally DI and internally timed Macro-modules [16, 111, 
as well as from more recent implementation strategies for 
quasi-DI and DI circuits [8, 13. The former relied on spe- 
cially designed and potentially slow meta-stability detec- 
tion circuits. The latter were based primarily on syntax- 
direct translation techniques from process, algebraic spec- 
ifications, rather than on logic synthesis with inherent op- 
timization under different cost functions. An alternative 
technique, that permits a certain level of delay-insensitivity 
for inter-block communication and relies on local timing 
conditions (Fundamental Mode operation), is based on a 
Burst Mode (BM) Finite State Machine specification [9]. 
The BM approach, however, is not very flexible from the 
point of view of the level of concurrency and distribution 
of control flow, as we will discuss in Section 5.  STG-based 
synthesis, which supports a more powerful Input/Output 
operation mode, allows one to build circuits with a com- 
pletely distributed environment, as opposed to the central- 
ized environment assumed by the FM conditions. 
In this paper we investigate the STG-based approach to 
the design of locally SI and globally DI asynchronous con- 
trol circuits, by posing the problem at the behavioral (STG) 
level. We believe that our method would be particularly ef- 
fective in the following two design flow sce:narios, both re- 
sulting in fairly large STG specifications that would benefit 
from DI interfacing: 
1. circuits specified using a high-level behavioral no- 
tation (such as CSP or high-level Petri net), subse- 
*This work was partially supported by ESPRIT ACID-WG (21949). CICYT TIC 
98-0410 and TIC 98-0949, UK EPSRC GRL24038 and G-93775, British Coun- 
cil (Acci6n Integrada MDR/1998/99/2463). 
0-7803-5832-5/99/ $10.00 @ 1999 IEEE 31 6 
... quently refined into a large binary encoded STG, 
... 
SLAVE 
DATA ... 
add-req 
add-ack 
For both scenarios it is appropriate to partition the large data-req 
specification at the STG level and synthesize its blocks 
with DI interfaces. In this decomposition, a natural ques- 
tion arises: what is the cost of DI interfacing? 
In order to answer this question, we developed the the- 
ory of iterative transformations of SI specifications to- 
wards DI interfacing (Section 3). The suggested approach 
was checked experimentally using the known set of asyn- 
chronous benchmarks and synthesis tool Petrify [3]. In our 
experiments (Section 4) we first partition the given circuit 
into two parts at the STG level, and then consider each 
part separately with the DI interface in between. We com- 
pared the original circuit (entirely SI) against the new one 
(SI circuit with DI interface). The results of this com- 
parison show that the cost of DI interfacing is on average 
about 36% for area and 20% for performance. These fig- 
ures are quite encouraging because in the known methods 
of DI synthesis the area and performance costs are much 
higher [6]. Finally, in Section 5 we generalize the pro- 
posed approach to obtain a globally DI implementation of 
a totally different specification formalism (BM machines). 
We believe that the combination of the SI and DI imple- 
mentation styles opens up new perspectives for efficient 
asynchronous design for DSM technologies. 
This work focuses on the automatic introduction of DI 
interfaces in the control part of the design. There we sev- 
era1 possible approaches to handling the data part as well. 
... 
(a) 
‘ P 4  / 
2. control circuits for regular control structures. 
y add-ack- add-req- > add_ack+-!?-. ; data_req+:- data_ack+ p7 
@) \ _ _ _ _ _ _ _ - _ _ *  
Figure 1 : Simple asynchronous interface 
Theoretical background 
Figure 1 .a shows a simple interface between two modules 
in an asynchronous system, a master (e.g., a processor) and 
a slave (e.g., memory). The interface involves two signal 
handshakes, one for controlling the transmission of an ad- 
dress (addreq and add,,k) and another for data (data,,, 
and data,,k). The timing diagram shown in Figure 1 .a de- 
fines the synchronization protocol between the handshakes 
for the case of writing data into the slave. This protocol 
allows an additional skew compensation between address 
and data, making sure that the address is delivered to the 
Slave strictly before data, thus permitting an additional de- 
lay in the corresponding address decode logic. This condi- 
tion is captured by the arc directed from the rising edge of 
the add,,, signal to that of data,,,. 
Figure 1.b shows the Petri Net (PN) corresponding to 
PN are interpreted as signal transitions: rising transitions 
of signal a are labeled with and falling transitions 
with we also use the notation a* if we are not spe- 
cific about the sign of the transition. Petri Nets with such 
1. The data-path can be designed using a DI-encoding 
(e.g., dual rail, Sperner codes etc. [20]). 
2. If a more efficient data is chosen the timing diagram of the con&oller. All events in this 
for the data-path, like in Micropipelines UT, the or- 
dering conditions between data and a corresponding 
request signal are simpler to satisfy than the order- 
ing conditions between several control signals, POS- 
sibly coming from different parts Of the 
sign two request lines 
de- 
a data bus 
an interpretation are called Signal Transition Graphs (or 
STGs) [2]. STGs are typically represented in a ‘‘short- 
and an address bus in an inter-module interface as in 
our first example in Figure 1). In particular, routing 
can be constrained so as to keep the skew of a bundle 
of wires to be under a small upper bound. 
hand” form, where places with one input and one output 
arc are implicit. 
An STG transition is enabled if all its input places con- 
lain a token. In the initial marking {pl, p 2 )  of the STG 
in Figure 1 .C transition add,,,+ is enabled. Every enabled 
transition can fire, removing one token from every input 
place of the transition and adding one token to every out- 
put Place. After the firing of transition addre,+ the net 
moves a new marking, {P3)9 where datare,+ becomes 
enabled. 
Transitions in STG could be involved in different order- 
Moreover, many designs involve large pieces of control- 
dominated logic without any data-path processing. Those 
include modu1o-N counters, multi-way pulse generators 
and distributors, arbiters etc. Their cell-by-cell layout, with 
DI interfaces between cells, which can internally be de- 
signed as SI or even locally timed, would make them suit- 
able as firm or hard macros in a DSM context. 
31 7 
a t  a t  a t  
Figure 2: Consistency violations in STG 
ing relations. Transitions a* and b* are in direct conflict if 
there exists a reachable marking in which both of them are 
enabled but firing of one of them disables the other. If a* 
and b* are enabled in some reachable marking but are not 
in direct conflict, they are concurrent. Conflict relations 
can be generalized by considering the transitive successors 
of directly conflicting transitions. Transitions which are 
not concurrent and are not in (transitive) conflict are or- 
dered. An STG is consistent if in every transition sequence 
from the initial marking, rising and falling transitions alter- 
nate for each signal. 
There are two sources of consistency violation in an 
STG: 
1. Auto-concurrency, due to concurrency of transitions 
of the same signal (see Figure 2.a,b) and 
2. Switchover incorrectness, due to ordered rising 
(falling) transitions which have no falling (rising) 
transition in between (see Figure 2s).  
The set of all signals in a STG is partitioned into a set 
of inputs, which come from the environment, and a set of 
outputs that must be implemented. 
In addition to consistency, the persistency property is 
required for an STG to be implementable as a hazard-free 
asynchronous circuit. An event a* is persistent in marking 
m if it is enabled in m and remains enabled in any other 
marking reachable from m by firing another event b*. An 
STG is output-persistent if all output signal events are per- 
sistent in all reachable markings and input signals cannot 
be disabled by outputs. Output persistency therefore only 
allows input events to be in direct conflict (thus modeling 
non-deterministic choice in the environment). 
The following important statement was proved in [2]: 
an STG can be implemented by a speed-independent circuit 
if it is consistent and output-persistent. 
3 Delay-Insensitive Interfacing 
Our approach has two distinctive features: 
0 It is focused not on total delay-insensitivity but on 
delay-insensitive integacing only. The basic as- 
sumption is that within a module the designer or 
a physical design tool can keep wire delays under 
control and hence there is no point to ensure delay- 
a 
insensitivity at the level of events internal to the 
module. 
Contrary to conventional approaches to DI synthe- 
sis, the tasks of designing a module and its environ- 
ment are considered separately. This results in asym- 
metric DI interfacing requirements: only inputs are 
required to be accepted in a delay-insensitive fash- 
ion by the circuit, because delay-insensitivity with 
respect to outputs matters only when the implemen- 
tation for the environment is synthesized. 
The above conditions lead to a more relaxed axiomatic 
definition of delay-insensitive interfacing with respect to 
the classical definition of delay insensitivity given in [ 191. 
A specification satisfies the delay-insensitive interfacing 
requirement if it meets the following conditions: 
1. No auto-concurrency. 
2. Alternating inputs (input events cannot be ordered 
3. No cross-disabling (inputs and outputs cannot dis- 
with other input events). 
able each other). 
Our design framework uses STGs as a model basis. The 
natural question is: what are the implications of the re- 
quirements of DI interfacing for the properties of the orig- 
inal STG? 
Proposition 3.1 A consistent and output persistent STG 
satisjies DI integacing conditions if and only if no input 
transition directly precedes another input transition. 
The proof is trivial: non-auto-concurrency is a nec- 
essary condition of STG consistency, absence of cross- 
disabling is guaranteed by output persistency and alterna- 
tion of inputs directly comes from the definition of DI in- 
terfacing. 
Proposition 3.1 gives an idea about the places where DI 
interfacing might be violated in an STG: these are STG 
fragments in which input transitions are directly causally 
related. The addition of arbitrary delays to every input wire 
may unpredictably alter the order of originally ordered in- 
puts to a module. This means that from the module point of 
view such inputs become concurrent. Hence the transfor- 
mation of an STG for DI interfacing removes direct causal 
dependencies between inputs and makes them concurrent. 
This transformation can be performed by iterative applica- 
tion of a simple operation that is called order relaxation 
and is intuitively defined in Figure 3. Note that order re- 
laxation makes previously ordered events a and b to occur 
concurrently “in a burst”. 
The following two properties of order relaxation help 
to clarify the transformation towards DI interfacing. Their 
proofs can be found in [ 131. 
31 8 
c d  c d  
k 
k f 
(ab) order 
relaxation 
Figure 3: Order relaxation 
Property3.1 Order relaxation between events a and b 
preserves pairwise ordering relations between all events 
except for a and b. 
Property 3.2 Order relaxation between two events pre- 
serves output persistency in an STG. 
When in the original STG two inputs are directly 
causally related, then DI interfacing can be obtained only 
by an order relaxation between them. The latter, by Prop- 
erty 3.2, does not cause any new cross-disabling to occur. 
Unfortunately not all the requirements of DI interfacing are 
safely preserved during order relaxation. Indeed if events a 
and b correspond to transitions of the same signal their or- 
der relaxation immediately produces auto-concurrency. If 
non-auto-concurrency is preserved the above transforma- 
tion is strictly delay-insensitivity increasing and by itera- 
tive application of it eventually (if non-auto-concurrency 
is preserved) all the requirements of DI interfacing are met 
in the modified specification. 
The algorithm for STG transformation to ensure DI in- 
terfacing is presented in Figure 4. The result of the al- 
gorithm is either a new STG in which DI interfacing re- 
quirements are satisfied or a failure in case when input or- 
der relaxation leads to auto-concurrency. The latter im- 
plies that the original STG cannot be implemented with DI 
interfacing.' 
Figure 5 illustrates the transformation to DI interfacing 
for the chuZ33 benchmark example (DI violations are de- 
noted by shading). DI interfacing is achieved by iterative 
application of order relaxation between input events. 
4 Experimental results 
Two types of experiments, corresponding to the design sce- 
narios outlined in the introduction, have been performed to 
test the proposed method. 
Case study: controller for analog-to-digital converter. 
In the first example, we consider the synthesis of a scal- 
able control circuit, whose STG specification has a regu- 
lar structure. It originates from a practical case study of 
'Indeed, Property 3.1 implies that the order in which one chooses the painvise 
order relaxation between inputs is irrelevant. 
Input: STG A = ( E ,  F, mo) ( E  - events, 
Output: 
F - precedence relations, mo - initiai marking) 
STG Adi = ( E , F ' , m b )  with DI interfacing 
foreach input events 0 and b ,  a -+ b do 
/ *  order relaxation step * /  
remove causal arc ( a , b )  
/ *  ordering predecessors of a with b * /  
foreach predecessor p of a do 
/ *  ordering successors of b with a * /  
foreach successor s of b do 
/*modify mo if necessary * /  
foreach initially marked arc ( p , a )  do 
foreach b --t s with arc ( b ,  s) initially marked do 
if ( a , b )  is initially marked then 
add arc p -+ b ;  
add arc a -+ 8 ;  
m o ( ( p , b ) ) +  = m o ( ( ~ , a ) ) ;  
mo((a , s ) )+  = m o ( ( b ,  8)); 
foreach arc ( p ,  b )  do m o ( ( p ,  b ) )+  = mo((4, b ) )  
if STG A becomes auto-concurrent then exit (failure) ; 
endfor 
Figure 4: Algorithm for ensuring DI interfacing. 
INPUTS U. El. zr 
OUTPUTS. La. Da Za 
Figure 5 :  Example of order relaxation 
an asynchronous SI controller for an analog-to-digital con- 
verter (ADC) [ 5 ] .  
This ADC implements a well-known successive ap- 
proximation algorithm. According to this algorithm, a 
comparator is iteratively activated to compare the value of 
the given input voltage with the approximate voltage pro- 
duced by a digital-to-analog converter (DAC), whose digi- 
tal input comes from a register in which the n-bit yalue is 
refined bitwise, starting from the most significant bit. Each 
refining bit is produced by a one-bit buffer connected to the 
output of the comparator. The use of asynchronous logic 
allows this system to avoid synchronization errors due to 
meta-stability (which is known to be a problem in clocked 
converters due to the analog part of the circuit), and to 
smooth out the temporal effect of potential meta-stability 
resolution [5] over the whole conversion period. 
The central part of the asynchronous ADC, which con- 
trols copying a bit value from the one-bit buffer to the n-bit 
register with a single bit shift, is an n-way scheduler; it 
is functionally similar. to a classical pulse distributor. The 
scheduler's behaviour can be specified by an STG whose 
structure is regular. The specification of a scheduler with 3 
cells is shown in Figure 6.a. 
From the analysis of the causal relations between events 
one could see that the behavior of the i-th cell of the 
scheduler depends on the state of the (i-1)-th and (i+l)-th 
31 9 
Figure 6: A specification of 3-cell scheduler (a) and the 
input order relaxation for the cell 1 (b) 
cells, together with the signal clamp (produced by com- 
pletion detection logic in a storage buffer; see Figure 7.a). 
Hence the speed-independent implementation of the sched- 
uler might be obtained directly using the STG of Figure 6, 
which gives the following logic circuit: 
Zi = clamp (%-1?ft+l + Zi) + l i ~ i - 1 ;  
xi = z i i - l x i  + l i  + 1i+1; - _  - 
b = 1112 . . .1, 
The drawback of the SI implementation is that the de- 
signer is responsible for satisfying the SI assumptions 
about wiring delays between scheduler cells. 
In case of conversion with a large data path (i.e., with 
many cells in the scheduler) or in order to increase the 
layout flexibility, it could be more convenient to partition 
the whole scheduler circuit into smaller parts. These could 
be placed in different positions on the chip (not necessar- 
ily adjacent) and thus require DI interfacing, while within 
each part the designer could still rely on the SI hypothesis, 
as shown in Figure 7.b. 
Figure 7: A scheduler circuit structure 
In order to evaluate an upper bound for the cost of par- 
titioning the scheduler we consider a partition into blocks 
with one cell each. Each cell communicates with its neigh- 
bors in a DI fashion, and therefore synthesis of such a 
scheduler reduces to the task of DI interfacing between 
cells. The result of order relaxation on the STG is shown 
in Figure 6, where for the i-th cell all tht: transitions of the 
inputs coming from the (i-1)-th and (i+l)-th cells are con- 
current. The result is shown in Figure 6.b. From this STG 
the following logic equations can be deriived: 
l i  = clamp (?&-l?fi+lxib + l i )  + l iFi-1;  
xi  = Ci- l (x i  + x i + ~ l i + l )  + li; 
- _  - -- 
b = llZz . . . l , (b + clamp) + b clamp 
A comparison between the SI and D1[ implementations 
shows that the latter is about 38% larger. We have also 
analyzed the performance of the SI anti DI implementa- 
tions, using logic simulation. We have synthesized both 
the scheduler circuit and its environment and simulated the 
resulting logic netlist. The degradation of performance due 
to the increased complexity is about 7%. 
It is worth noting that these number are significantly 
lower than those usually reported when referring to synthe- 
sis results for DI implementations (see e.g [6] where the 3 
times overhead was reported for a DI implementation of a 
stack against its SI countepart). The reason for that lies in 
our more flexible design strategy, that is speed-independent 
circuits with DI intelfacing instead of totally DI solutions. 
Delay-insensitive decomposition. Another group of ex- 
periments was targeted at DI decomposition of a relatively 
complex circuit into two simpler subcircuits with a DI in- 
terface. The experiment (illustrated in Figure 8) started 
from a well-known asynchronous benchrnark set, in which 
also the environment was synthesized (thus yielding cir- 
cuits without inputs). The set of signals of each benchmark 
was partitioned into two groups, thus yielding two sepa- 
rate modules as shown in Figure 8.b. Eiich module plays 
the role of the environment for its counterpart, and the in- 
terface between them is made delay-insensitive by apply- 
ing order relaxation between events which are input for 
each module. Note that this process does not always con- 
verge to a correct implementation because of violations in 
non-auto-concurrency resulting from order relaxation (this 
means that decomposition for DI interfacing could be used 
as a guidance criterion for asynchronous system partition- 
ing). For all cases where DI interfacing could be obtained 
for some wire partition, we compared the DI implementa- 
tion (Figure 8.c) against the SI one (Figure 8.a) in terms of 
area and performance. The results are shown in Table 1. 
On average the area penalty is about 36% and the perfor- 
mance degradation is about 20%. 
320 
r -  - - -  -4 
j delay4 
I .  
Circuit 
c u  
CLt33;:; 
chul50 
Figure 8: The experimental flow for DI decomposition 
~ - 
SI DI ratio SI DI ratio 
E : ::E :E 2 ::z 
184 232 1.26 6268 7000 1.12 
I I I  Area II Performance I 
mmu(1) 360 552 1.53 5909 7246 1.23 
master-r (1) 
master-r(21 376 472 1.26 5993 7650 1.28 
1 nuNI wrdatab (2) 11 1 1 /$ / /  ig 1 !!$ 1 1 
344 448 1.30 7799 9722 1.25 
vbelOb 320 392 1.23 8053 8736 1.08 
trimos 264 456 1.73 6764 7462 1.10 
Total 4456 6072 1.36 98052 114173 1.16 
Table 1 : Area and performance penalty of DI interfacing 
5 Other applications of DI interfacing 
Up to now the DI interfacing approach has been discussed 
in the context of a system architecture consisting of speed- 
independent modules with DI communication. In that 
case the starting point for DI transformation is a speed- 
independent specification of the system, which is gradually 
refined to satisfy the conditions of DI interfacing. 
However, the DI interfacing approach is certainly not re- 
stricted to that particular architecture. The main idea of the 
approach is that delays of system components (gates and 
wires) are roughly separated into two classes: controlled 
and uncontrolled. Components with controlled delays are 
restricted to be placed in close vicinity in the chip and are 
considered to be in the same logic module. Uncontrolled 
delays are due to communications between different logic 
modules [ 141. Hence DI interfacing should work equally 
well when logic modules are implemented under more ag- 
gressive timing assumptions than speed-independence. It 
is the responsibility of the designer to ensure that each 
module functions correctly under these timing assump- 
tions, while the overall correctness of the system is ensured 
by the DI interfacing between modules. 
A possible extension of the suggested approach is illus- 
trated below by implementing a system with DI interfacing 
starting from burst-mode (BM) behavioral specifications 
l-9,211. 
Unlike the cases discussed in Section 4, it will result in 
order relaxation between output signals. 
A burst-mode machine is an FSM-like specification in 
which each state’ transition is caused by a burst of con- 
currently switching inputs followed by a burst of concur- 
rently switching output and state signals. Implementation 
of a BM specification relies on the so called Fundamental 
Mode hypothesis. This hypothesis states that the reaction 
of the environment is relatively slow, and a new input burst 
can only start when all the switching activity caused by the 
previous burst inside the circuit has stopped. 
A burst-mode specification can be equivalently repre- 
sented by an STG model. Figure 9(b) shows the BM spec- 
ification of a FIFO for a SCSI controller [22], while Figure 
9(c) shows its equivalent STG representation. Note that 
the fundamental mode assumption must be translated in 
the corresponding STG as causal arcs which synchronize 
output bursts (e.g., aout- and rout+) with the next input 
bursts (e.g., rin+ and ain+)2. 
BM specification does not allow any direct ordering be- 
tween inputs: either inputs occur in a burst (concurrently) 
or they are separated by transitions of output orland state 
signals. This means that each individual BM machine nat- 
urally satisfies the conditions of DI interfacing (see Section 
3). However ensuring DI interfacing between a set of com- 
municating BM machines is more complicated than for SI 
modules, because the DI interfacing conditions (Section 3) 
take into account the behavior of input signals only. 
Outputs can change in any order, and their proper re- 
ception must be ensured by the receiving modules. There- 
fore the notion of DI interfacing for a set of SI modules 
relies on “distributed responsibilities”: each module can 
accept DI inputs, and all modules together cooperate in 
a globally DI fashion. This is reasonable because speed- 
independence makes only local timing assumptions (on 
gate fanout wires). 
This approach will not work for the case of BM ma- 
chines because of the non-locality of the fundamental 
mode assumption. Indeed for the FIFO in Figure 9(c) the 
fundamental mode assumption requires that the transitions 
aout- and rout+ of both outputs precedes the transitions 
rin+ and ah+  of both inputs. However ain+ is produced 
by the (i+l)-th cell of the FIFO, that receives only aout 
as input, while rin+ is produced by the (i-1)-th cell, that 
receives only rout as input. Therefore the fundamental 
mode assumption is a timing requirement which cannot be 
ensured only by the local analysis of pairwise communica- 
tions, but requires global timing analysis. Imposing tim- 
ing assumptions on the speed of independent handshakes 
clearly contradicts the nature of DI communication. Hence 
for the case of outputs which communicate with different 
BM machines the fundamental mode assumption must be 
refined via relaxation of output synchronization. Note that, 
contrary to the input order relaxation in case of SI mod- 
ules (which is defined purely by a syntactic transformation 
’The dummy transition labeled with X is equivalent to four arcs, between each 
output transition and each input transition. 
32 1 
of the STG), the refinement of the fundamental mode as- 
sumption requires additional semantic information about 
the structure of the distributed environment of the module 
(which signals communicate with which other modules). 
r-- rout - rin 
cell cell 
i-1 ain i aout 
- __ 
rin*ain+/ 
rout- rin-/ 
- 
cell 
i+l aout-rout+ 
L_ 
aoutt 
b) 
rout‘ 
cell 
0 
- 
. i n +  rin+ 
rout - rin -.-- rin 
cell cell 
ain 1 aout 2 
- - 
I I rin+ain-/ 
aout+ 
aout- rout+ 
Figure 9: DI transformation of the BM FIFO specification 
For the case of the FIFO in Figure 9(c), the refinement 
results in the relaxation of the synchronization for the out- 
put burst aout- and rout+. Considering the natural sep- 
aration of i-th FIFO cell environment into left and right 
handshakes (rin, rout) and (ain, aout), this results in the 
new STG shown in Figure 9(d). 
Synthesizing the STG in Figure 9(d) requires one more 
state signal than the STG in Figure 9(c). The performance 
penalty was evaluated by checking the speed of a 3-cell 
FIFO buffer like the one shown in Figure 10. 
Figure 10: 3-cell FIFO buffer with closed environment 
The resulting data is shown in Table 2 in columns BM 
and DI. Note that the only place where the fundamental 
mode assumption comes into play in the STG in Figure 
9(c) is the output burst (aout-rout+). After its relaxation 
the STG in Figure 9(d) makes no implicit timing assump- 
tions. Hence the FIFO buffer synthesized by this STG is in 
fact implemented as locally SI and globally DI. 
We also exploited locally the same (usually reasonable) 
timing assumption that BM synthesis makes, namely that 
the delays of the circuit of a single FIFO cell are smaller 
than the delays of the handshakes between cells. 
The results of this optimization are shown in Table 2 in 
column DIopt. Timing optimization improves the perfor- 
mance penalty to become only 23%. 
Table 2: Area and performance penalties for DI interfacing 
of the BM FIFO 
We also analyzed the cost of DI interfacing for other two 
parts of the SCSI controller (the Bus Interface Unit, BIU, 
and the Initiator Send, IS). The resulting area penalties (in 
terms of literals of the logic implementatiosn) are presented 
in Table 3. 
For these specifications the cost of transformation to DI 
interface is rather low, due to the fact that ithe fundamental 
mode is used only in a few cases in the !KSI controller, 
namely in 3 bursts out of 11 for IS and in 1 burst out of 9 
for BIU. 
Table 3: Area penalty for SCSI controller 
6 Conclusions 
Design styles which neglect wire delays seem to be overly 
optimist$ even with the current technology, and will most 
likely become less and less applicable when moving to 
deep sub-micron implementations. The extreme case when 
wire delays are assumed to have arbitrary values leads to 
the well known delay-insensitive approach for circuit de- 
sign. However delay-insensitive circuits are often unus- 
able because of their excessive area and performance over- 
heads. In this paper we suggested an approach which re- 
sults in partial delay-insensitivity of an implementation. 
Under this approach a designer identifies, a set of “dan- 
gerous” wires which should be implemented in a delay- 
insensitive fashion, while for the rest of a circuit other 
(more conventional) design styles might be applied. In 
particular, we used speed-independent implementation for 
the parts of a system in which wire delays could be con- 
trolled by the‘designer or a routing tool, and then applied 
the delay-insensitive hypothesis only to the wires running 
between such speed-independent “islands”. 
We have developed an automated method which trans- 
forms an originally speed-independent specification into a 
specification with DI interface. Contrary to the common 
belief about the high area and performance penalty of DI 
circuits, our experimental results show that the cost of DI 
interfacing is rather moderate: about 40% for area and 20% 
for speed. This is a direct consequence of a more flexible 
322 
strategy of partitioning a system into its speed-independent 
and delay-insensitive sub-domains. 
Acknowledgments 
We are grateful to Alexander Taubin from Theseus Logic 
Inc. for many useful discussions and critics. 
References 
Kees van Berkel. Handshake Circuits: an Asynchronous Architec- 
ture for  VLSl Programming, volume 5 of International Series on 
Parallel Comnputation. Cambridge University Press, 1993. 
[18] D. Sylvester and K. Keutzer. Getting to the bottom of deep submi- 
cron. In Proceedings of the International Conference on Cotputer- 
Aided Design, November 1998. 
[19] J. T. Udding. A formal model for defining and classifying delay- 
insensitive circuits and systems. Distributed Computing, 1 : 197- 
204, 1986. 
[20] Tom Verhoeff. Delay-insensitive codes-an overview. Distributed 
Computing, 3(1):1-8, 1988. 
[21] Kenneth Y. Yun and David L. Dill. Automatic synthesis of 3D asyn- 
chronous state machines. In Proc. International Con$ Cotnputer- 
Aided Design (ICCAD), pages 576580. IEEE Computer Society 
Press, November 1992. 
T.-A. Chu. Synthesis of Self-timed VU1 Circuits from Graph- 
theoretic Specifications. PhD thesis, MIT, June 1987. 
J .  Cortadella, M. Kishinevsky, A.Kondratyev, L. Lavagno, and 
A. Yakovlev. Petrify: a tool for manipulating concurrent specifi- 
cations and synthesis of asynchronous controllers. IElCE Trunsac- 
rions on Information and Systems, E80-D(3):3 15-325, March 1997. 
J. Cortadella, M. Kishinevsky, A. Kondratyev, L. Lavagno, 
A. Taubin, and A. Yakovlev. Lazy transition systems: application to 
timing optimization of asynchronous circuits. In Proc. International 
Con$ Computer-Aided Design (ICCAD), pages 324-331, Novem- 
ber 1998. 
D. J. Kinniment, B. Gao, A. V. Yakovlev, and F. Xia. Toward asyn- 
chronous A-D conversion. In Proc. International Svtnaosiumn on 
[22] Kenneth Y. Yun and David L. Dill. A high-performance asyn- 
chronous SCSI controller. In Proc. International Con$ Computer 
Design (ICCD), pages 44-49. IEEE Computer Society Press, 1995. 
Advanced Research in Asynchronous Circuits and Sistlms, pages 
206-215, 1998. 
M. Kishinevsky and Christian Nielsen. Naive design of an un- 
bounded stack. Presentation at the DagstuN Seminar on Self-Timed 
Design, 1992. 
Lucian0 Lavagno and Albert0 Sangiovanni-Vincentelli. Algorithms 
for Synthesis and Testing of Asynchronous Circuits. Kluwer Aca- 
demic Publishers, 1993. 
Alain J. Martin. Compiling communicating processes into delay- 
insensitive VLSI circuits. Distributed Computing, 1(4):226-234, 
1986. 
Steven M. Nowick and David L. Dill. Automatic synthesis of 
locally-clocked asynchronous state machines. In Proc. Interna- 
tional Con$ Computer-Aided Design (ICCAD), pages 318-321. 
IEEE Computer Society Press, November 1991. 
R.H.J.M. Otten and R.K. Brayton. Planning for performance. In 
Proceedings of the Design Automation Conference, June 1998. 
Fred U. Rosenberger, Charles E. Molnar, Thomas J.  Chaney, and 
Ting-Pien Fang. Q-modules: Internally clocked delay-insensitive 
modules. IEEE Transactions on Computers, C-37(9): 1005-101 8, 
September 1988. 
S.-H.Chung and S.B. Furber. The design of the control circuits 
for an asynchronous instruction prefetch unit using STGs. In 
Proc. Second Int. Workshop on Hardware Design and Petrin Nets 
(HWPN’99), Mlliamsburg, VA, pages 131-148, June 1999. 
H. Saito, A.Kondratyev, J. Cortadella, L. Lavagno, and A. Yakovlev. 
What is the cost of delay insensitivity? Technical Report 99-2-004, 
The University of Aim, August 1999. 
C. L. Seitz. Chapter 7. In C. Mead and L. Conway, editors, Intro- 
duction to V U 1  Systems. Addison Wesley, 1981. 
Ken Stevens, Shai Rotem. Steven M. Burns, Jordi Cortadella, Ran 
Ginosar, Michael Kishinevsky, and Marly Roncken. Cad directions 
for high performance asynchronous circuits. In Proceedings of the 
Design Automation Conference, pages 116-121, June 1999. 
Mishell J. Stucki, Sever0 M. Omstein, and Wesley A. Clark. Log- 
ical design of macromodules. In A FIPS Conference Proceedings: 
1967 Spring Joint Computer Conference, volume 30, pages 351- 
364, Atlantic City, NJ, 1967. Academic Press. 
Ivan E. Sutherland. Micropipelines. Communications of the ACM, 
32(6):72&738, June 1989. 
323 
