Two trends are of major concern for digital circuit designers: the relative increase of interconnect delays with respect to gate delays and the demand for design reuse. Both pose difficult problems to synchronous design styles, and can be tackled more naturally within the asynchronous paradigm. Unfortunately even in asynchronous design the normal hypotheses about the delays of gates and wires are often overly optimistic. One of the popular assumptions is to consider gate delays to be arbitrary while neglecting the skew in wire delays (so-called speed-independence (SI) assumption). Taking wire delays into account is possible and in its extreme leads to delay-insensitive (DI) implementations which work correctly under any wire delay distribution. However, such implementations are costly.
Introduction
Asynchronous systems, free from the clock, offer a number of potential advantages in Deep-Sub-Micron digital and mixed-signal design. They include robust- 
III -899
ness of designs to technology variations, greater modularity and capability for component reuse. These factors are essential in complex applications where complete redesign for a localized functionality change becomes unrealistic, and where time-to-market is crucial.
Two subclasses of asynchronous circuits are known to be able to sustain certain parameter variations: speed-independent (SI) circuits [7] and delayinsensitive (DI) circuits [ll] . The former are characterized by the fact that their behaviour is insensitive to gate delays (these can have arbitrary value) but assume wire delays to satisfy the following condition (isochronicity of forks): the max delay of a wire after a fork must be less than the min gate delay. DI circuits allow wire delays to have arbitrary values. Although DI circuits are clearly much more attractive for the Deep Sub-Micron technology, where wire delays are as significant as gate delays, the domain of functionally useful DI circuits is very limited if one considers them at the level of ordinary logic gates. Thus DI circuits are typically constructed out of macro-modules that consist of several gates [IO] .
It is therefore quite natural to look for a way of exploiting the advantages of both design strategies, namely the optimality of the SI logic synthesis and the design robustness and DI compositionality of the macro-module method. The target of the synthesis process is therefore deemed to be a globally DI locally SI circuit. This approach was suggested, without any concrete implementation strategy, in [12] .
Our work has some similarities (and in particular a consistent view of technology trends) with the wire planning strategy suggested in [8] . In both cases, logic synthesis is preceded by a "delay-aware" step that partitions the system into blocks where wire delays are smaller than gate delays. However, due to the synchronous nature of the underlying implementation, [8] requires a placement and global routing step before synthesis. On the other hand, in our case only the communication protocol between the blocks must be (automatically) modified to satisfy a set of DI axioms. After that, logic synthesis can proceed independently for all the blocks, without requiring any synthesis/layout iteration or interaction. This simplifies dramatically the timing convergence problem for asynchronous ASICs. The modeling formalism for the suggested design flow is based on Signal Transition Graphs (STG). It is known that from an STG one can derive a speed-independent implementation using different design procedures [l, 41. In this paper we suggest a behavioral transformation called order relaxation which is aimed to allow delay-insensitivity with respect to certain STG events. Based on this transformation an initial specification could be iteratively refined until the desired level of delay-insensitivity is reached.
The rest of the paper is organized as follows. Section 2 contains a theoretical background. The theory behind DI transformation is presented in Section 3. Section 4 shows an application of the suggested methodology to a realistic design example. Section 5 concludes the work. Figure 1 .a shows a simple interface between two modules in an asynchronous system, a master (e.g., a processor) and a slave (e.g., memory). The interface involves two signal handshakes, one for controlling the transmission of an address (add,,, and add,,k) and another for data (datareq and data,,k). The timing diagram shown in Figure 1 .a defines the synchronization protocol between the handshakes for the case of writing data into slave. This protocol allows an additional skew compensation between address and data, making sure that the address is delivered to the slave strictly before data, to give an additional delay in the corresponding address decode logic. This condition is captured by an arc directed from the rising edge of the add,,, signal to that of datareq. Figure 1 .b shows the Petri Net (PN) corresponding to the timing diagram of the controller. All events in this PN are interpreted as signal transitions: rising transitions of signal a are labeled with "a+" and falling transitions with "a-". We also use the notation a* if we are not specific about the sign of the transition. Petri Nets with such an interpretation are called Signal Transition Graphs (or STGs) [l] . STGs are typically represented in a "shorthand" form, where places with one input arc and one output arcs are implicit.
Theoretical background
An STG transition is enabled if all its input places contain a token. In the initial marking {pl,p2} of the STG in Figure 1 .c transition add,,,+ is enabled. Every enabled transition can fire, removing one token from every input place of the transition and adding one to- There are two sources of consistency violation in an STG:
1) Auto-concurrency, i.e. concurrency of transitions of the same signal (see Figure 2 .a,b) and 2) Switch-over incorrectness, which takes place between two ordered rising (falling) transitions which have no falling (rising) transition in between (see Figure 2.c).
The set of all signals STG is partitioned into a set of inputs, which come from the environment, and a set of outputs and state signals that must be implemented.
In addition to consistency, the persistency property is required for an STG to be implementable as a hazardfree asynchronous circuit.
An event a* is persistent in marking m if it is enabled in m and remains enabled in any other marking reachable from m by firing another event b*. An STG is output-persistent if all output signal events are persistent in all reachable markings and input signals cannot be disabled by outputs. Output persistency therefore only allows input events to be in direct conflict.
The following important statement was proved in [l] : an STG can be implemented by a speed-independent circuit if at is consistent and output-persistent.
No cross-disabling (inputs and outputs cannot disable each other)
In this work we will relax the above axioms taking into account specific features of the targeted task:
The investigation is focused not on total delayinsensitivity but on the delay-insensitive interfacing only (the basic assumption is that within a module a designer or physical design tool can keep wire delays under control and hence there is no point to ensure delay-insensitivity at the level of events internal to a module).
Contrary to conventional approaches to DI synthesis the tasks of designing a module and its environment are considered separately. It results in asymmetry of requirements which are imposed on DI interface: only inputs are required to be accepted in a delay-insensitive fashion because delayinsensitivity with respect to outputs matters only when an implementation for the environment is synthesized. Of course, symmetry is re-established if all modules are synthesized in this fashion.
Informallv the above conditions are illustrated bv Figure 3 where the suggested design approach is targeted at an interface scheme that should be robust to wire delay variations. 
Alternating inputs (input events can only immedi-

No cross-disabling
Our design framework uses STGs as a model basis. The natural question is: what are the implications of the requirements of delay-insensitive interfacing for the properties of the original STG? Proposition 3.1 A consistent and speed-independent STG satisfies DI interfacing conditions if no input transition directly precedes another input transition.
The proof is trivial: non-auto-concurrency is a necessary condition of STG consistency, absence of crossdisabling is guaranteed by speed-independence and alternation of inputs directly comes from the conditions of the proposition.
ately precede outputs events)
Proposition 3.1 gives an idea of where DI interfacing may be violated in an STG: these are STG fragments in which input transitions are directly causally related. After adding arbitrary delays into every input wire (see Figure 3) a given module may receive originally sequenced inputs in any order. The latter means that from the module point of view such inputs are concurrent. Hence a possible transformation strategy for an STG towards DI interfacing removes direct causal dependencies between inputs and making them concurrent. This transformation might be performed by iterative application of a simple operation which we will call order relaxation and illustrate by Figure 4 . Informally order relaxation removes a causal arc between events U and b making them concurrent while keeping other ordering relations as much as possible. The following two properties of order relaxation help to understand better the transformation towards DI interfacing. Figure 4 ). Suppose that the statement of the property is wrong. Then one could find a pair of instances cz and dJ such that c2 precedes dJ in a concurrent run When in the original STG two inputs are directly causally related, then according to Definition 3.1 DI interfacing can only be obtained by applying order relaxation to them. The latter by Property 3.2 does not cause any new cross-disablings to occur. Unfortunately not all the requirements of DI interfacing are safely preserved during order relaxation. Indeed if events a and b correspond to transitions of the same signal their order relaxation immediately produces auto-concurrency. In case this does not happen the above transformation is strictly delay-insensitivity increasing and by iterative application of it eventually (if non-auto-concurrency is preserved) all the requirements of DI interfacing should be met in the specification. Figure 5 .
. For this specification DI interfacing is violated by a direct causal dependencies between input transitions add,,,+ and data,,,+ (this violations is denoted in
a by shading). The violations can be removed by performing order relaxation between input events.
The order relaxation between add,,,+ and data,,,+ results in the removal of the arc (add,,,+,data,,,+) , adding direct predecessors of add,,,+ to data,,,+ (i.e. data,,k- 
Controller for analog-to-digital converter
In this section we present an experiment that has been carried out to test the proposed method and evaluate the cost of DI interfacing in a more practical design example than those considered above.
The example originates from a practical case study in which an asynchronous analog-to-digital converter (ADC) has been developed with a speed-independent controller [3] .
This ADC implements a well-known successive approximation algorithm. According to this algorithm, a comparator is iteratively activated to compare the value of the given input voltage with the approximate voltage produced by a digital-to-analog converter (DAC), whose digital input comes from a register, in which the n-bit value is refined bitwise, starting from the most significant bit. Each refining bit is produced by a one-bit buffer connected to the output of the comparator. The use of asynchronous logic allows this system to avoid synchronization errors due t o metastability (which is known to be a problem in clocked converters), which may arise in the analog part of the circuit, and to smooth out the temporal effect of potential meta-stability resolution [3] one the whole period of conversion.
The central part of the asynchronous ADC, which controls copying a bit value from the one-bit buffer to the n-bit register with a single bit shift, is an nway scheduler; it is functionally similar to a classical pulse distributor. The scheduler's behaviour can be specified by an STG whose structure is regular. The specification of a scheduler with 3 cells is shown in Figure 6 .a. The drawback of the SI implementation is that the designer is responsible for satisfying the SI assumptions about wiring delays between scheduler cells.
In case of conversion with a data path (with many cells in the scheduler) or in order to increase the flexibility of layout, it could be more convenient to partition the whole circuit of the scheduler into smaller parts which could be placed in different positions on the chip (not necessarily adjacent). Then within each part the designer could still rely on the SI hypothesis about the wiring between cells but in the interface between these parts the wire delays could be large and we need a more conservative approach. In the extreme, interface delays are assumed to be arbitrary which leads to DI interfacing and gives the scheduler structure shown in Figure 7 .b. We have also analyzed performance for the SI and DI implementations, using logic simulation. We have synthesized both the scheduler circuit and its environment and simulated the resulting autonomous system. From Tablel one could see that the degradation of performance because of the increased complexity of DI implementation is about 7%.
It is worth noting that these number are significantly lower than those usually reported when refer; ring to synthesis results for DI implementations. The reason for that lies in our more flexible design strategy, that is speed-independent circuits with DI interfacing instead of totally DI solutions.
Conclusions
Design styles which neglect wire delays seem to be overly optimistic even with current technology and will most likely become less and less applicable when moving to deep sub-micron implementations. The extreme case when wire delays are assumed to have arbitrary values leads to the well known delay-insensitive approach for circuit design. However delay-insensitive circuits are often unusable because of their excessive area and performance overheads. In this paper we suggested an approach which results in partial delayinsensitivity of an implementation. Under this approach the designer or floor-planning tool identifies a set of long wires, which should be implemented in delay-insensitive fashion while for the rest of a circuit other (more conventional) design styles might be applied. In particular, we used speed-independent implementation for the parts of a system in which wire delays could be controlled by the designer or a routing tool, and then applied the delay-insensitive hypothesis only to the wires running between such speed-independent "islands" [8] . These wires then can be routed to any distance, without affecting the functionality of the circuit (only, of course, its performance), thus dramatically speeding up timing convergence for asynchronous ASICs.
We have developed an automatic method which ensures the DI requirements by using behavior transformations. To the best of our knowledge, this is the first method which produces a Delay Insensitive implementation from a formal specification by using a highly optimizing synthesis-based procedure. We believe that this could give a significant reduction in area and performance penalties in comparison to the conventional DI methods which are based on direct translation of the initial description into the circuit by using pre-defined library modules, followed at most by a conservative peephole optimization.
It is possible to extend this approach to a more aggressive optimization (for both area and speed) strategy than the SI one, to partially compensate the costs of the DI interfacing. It is based on the use of relative timing at the module level, which can be gradually introduced into the SI logic [2, 91. 
