Abstract
Introduction
Recent years have seen a revival of interest in the sub-class of asynchronous circuits called speedindependent circuits. Such circuits have been characterized by Muller in his seminal paper [9] as being hazard-free using the unbounded gate delay model. Even though neglecting wire delays can be restrictive in practice, speed-independent circuits are a good starting point for synthesis and optimization procedures that use more detailed and realistic delay models. On the other hand, very efficient analysis and synthesis techniques, supported by CAD tools, exist for speed-independent circuits today.
Current synthesis techniques still suffer from a severe limitation: either they assume that the implementation library contains and gates with unbounded fanin and "free" input inversions ( [l, 5, 81) or they use non-standard "hazard absorbing" flip-flops whose effectiveness in practice still needs to be evaluated *This work has been partly supported by the Ministry of Education of Spain (CICYT TIC 95-0419).
tThis work has been partly supported by MURST research project "VLSI architectures". $This work has been partly supported by the U.K. EPSRC GR/J52327 and the British Council Programme (Spain) Acciones Integradas MDR/1996/97/1159. NE1 7RU England ([ll] ). Other results on the implementability of semimodular circuits without inputs using two-input/twooutput and and or gates ( [15] ) are only interesting from a theoretical standpoint, due to their extremely high implementation cost.
Only recently people have begun to analyze the decomposability of speed-independent circuits using a given, realistic standard cell-like library. The approach described in [13] works only under the fundamental mode assumption', which is overly restrictive and does not fit well theoretically with the unbounded delay assumption. The same authors describe in [12] a method to perform technology mapping for speed-independent circuits that only decomposes existing gates (e.g., a 3-input A N D into 2 2-input ANDs), without any further search of the implementation space. They do not explore complex de- compositions, that could use multi-cube divisors or decompose several gates simultaneously. The same limitations also affect the work of [l, 21. Most recent examples of relevant work can be found in [lo, 11, 41 ; each of them lacks flexibility either with respect to the gate library, the scope of optimization or the extent of logic sharing.
The main contribution of this paper is an efficient solution of the technology mapping problem for speedindependent circuits. We have developed a body of theory that allows us to prune the search space when looking for solutions. We use classical logic synthesis techniques for combinational multi-level logic in order to find good candidate functions for the decomposition. We then derive efficient filtering conditions that guar ant ee : 0 speed-independent implementability of the new signals, and 0 a bound on the global increase in complexity of the circuit, due to the need to acknowledge the new signals.
'I.e., the environment is not allowed to change circuit inputs unless the circuit is stable.
acdx

T+-
OoOl -
Theoretical background
In this section we introduce theoretical concepts required for understanding our decomposition method. These concepts are subdivided into three parts: (1) circuit specification and its basic logic implementability; (2) conditions of hazard-free decomposition of complex gates; and (3) correctness-preserving transformations to ensure those conditions. They are summarized in the following subsections.
State Graphs and Logic
A 
Hazard-free implementability
The decomposition of a complex gate into smaller gates creates new signals, that are not part of the original specification. In order to guarantee that these new signals do not produce.hazards, we must ensure that their covers satisfy the important property of monotonicity, which is defined in this section.
Necessary and sufficient conditions for outputpersistent implementation using unbounded fanin and gates (with unlimited input inversions), bounded fanin or gates and C elements were given in [8] (extending a previous result of [l] ). In this work we are considering a similar basic implementation architecture, called the standard-C architecture, which is described in Figure 2 . The difference from previous work is that instead of unbounded fanin and and or gates for the first and second levels, we will allow only implementable gates, that is gates which exist in the chosen library. The conditions derived in [8] are correct also in the presence of input inversions if the delay of the inverter does not exceed that of the remaining logic on the fastest feedback loop involving the inverter itself. The set of events entering states of an excitation region ERj(a*) from outside the region is called a set of trigger events for event a*. Looking at a circuit implementation of an SG, signals whose events are trigger for an event of a signal a will certaznly be inputs (called trigger signals for U ) to the logic circuit implementing a.
The quiescent region &Rj(a*) of a given signal transition with excitation region ERj(a*) is a maximal set of states s reachable from ERj(a*) such that (1) a is stable in s and (2) 
cj(a*) does not cover any state from ERi(a*) U
cj(u*) changes at most once within QRj(a*).
Q&(Q*), where i # j.
Under these conditions, it is possible to show that the outputs of the first-level gates are one-hot encoded, and that means that any valid Boolean decomposition of the second-level or gates will be speed-independent.
The chosen architecture also covers the case in which a signal in the specification admits a combinational implementation (called a complete cover). In that case the set and reset network are the complement of each other, and the C element with identical inputs can be simplified to a wire (see Figure 2 ,b,c).
Property-preserving event insertion
Our decomposition method is essentially behavioural -the extraction of new signals at the structural (logic) level must be matched by an insertion of their transitions at the behavioural (SG level. Event insertion is an operation on the SG w h ich selects a subset of states, splits each state in it into two states and creates, on the basis of these new states, an excitation and switching region for a new event. Figure 3 shows the chosen insertion scheme, analogous to that used by most authors in the area [14] , in the three main cases of insertion with respect to the position of the states in the insertion set ER($) (entrance to, exit from or inside ER($)). State signal insertion must also preserve the speedindependence of the original specification, that is required for the existence of a hazard-free asynchronous circuit implementation. Formally, we say that an insertion state set E R ( x ) , in an SG A' obtained from a deterministic and commutative SG A by inserting event x, is a speed-independence preserving subset (SIP-set) iff (1) for each a E E , if a is persistent in A, then it remains persistent in A', and (2) A' is deterministic and commutative. An efficient method of finding SIP-sets based on the notion of regions has been proposed in [6] .
Assume that the set of states S in an SG is partitioned into two subsets which are to be encoded by means of an additional signal. This new signal can be added either in order to satisfy the CSC condition, or to break up a complex gate into a set of smaller gates. In the latter case, a new signal is added to represent the output of the intermediate gate added to the circuit.
Let r and T = S -r denote the blocks of such a partition. In order to implement such an encoding, we need to insert appropriate transitions of the new signals in the border states between the two subsets.
In this aper we shall consider the so-called input border (IBY of a partition block r , denoted by I B ( r ) , which informally is a subset of states of r which have predecessors not in r . We call I B ( r ) well-formed if there are no events leading from states in r -J B ( r ) to states in IB(r).
Insertion of a new si nal can be formalized with the notion of I-partition (141 used a similar definition).
Given an SG with a set of states S, an I-partition is a partition of S into four blocks: So, SI, S+ and S-. So(S1) defines the states in which 2 will have the value 0 (1). S+(S-) defines ER($+) (ER($-)). For a consistent encoding of 2, the only allowed events crossing boundaries of the blocks are the following:
3 The technology mapping method As described in the previous section, any deterministic, commutative, output-persistent SG with the CSC property and satisfying the Monotonous Cover conditions can be implemented using the standard-C architecture.
This section describes how the potentially large abstract gates derived from the Monotonous Cover implementation can be decomposed into library gates while maintaining speedindependence. The potentially huge search space is limited by an efficient search algorithm that prunes decompositions that are guaranteed to violate speedindependence. The overall algorithm for sigfial insertion aimed at logic decomposition is sketched below. The next sections describe each step in more detail. The algorithm can be tuned by trading-off efficiency and quality of the results. For example, other events different from a* can be also selected for decomposition in case no good divisor is found for a*.
Note that the algebraic divisors are only used for a preliminary choice of the function of the new signal to be added to the SG. The well-formedness conditions are then used to refine this function, so that it has a speed-independent implementation. The implementation of every signal in the circuit is recomputed at every step. This practically implements boolean division and can even obtain sequential decomposition. The conditions discussed below are used to guarantee progress at each step. We prune those divisors that would excessively increase the complexity of other signals due to the requirement to acknowledge every transition of the new signal to satisfy speed-independence.
We use the example hazard.g from the set of asynchronous benchmarks to illustrate our algorithm. Its SG is shown in Figure 1 ,a and an MC-implementation of the output signals c and d is presented in Figure   5 ,a. Our target is the decomposition of function Sz into two-input gates, because two-input gates are a standard worst case against which the performance of a decomposition algorithm can be measured. We assume here familiarity with multi-level logic
As tradition-ally done in multi-level combinational synthesis, we have chosen algebraic division as the main operation for logic decomposition. However other divisors (e.g. boolean divisors) might also be considered within this scheme. For each event a*, there may be several optimal functions that can implement a monotonous cover c(a*). Instead of finding decompositions for all valid covers, we choose only one of the minimum-cost covers and seek algebraic divisors for it, so as to avoid an explosion in the computational cost of the search.
Thus, for each cover c(a*) we seek algebraic divisors, aiming at decompositions of the followin type:
c(a*) = f * g + r where g is the quotient c( ! a*) / f . AND-decomposition is done when r=O, whereas OR decomposition occurs when g=1.
To find good divisors f for c(a*) the following functions are considered:
0 Kernels and co-kernels of c(a*). 0 If c(a*) is a poly-term cover, any subset of terms of the sum-of-product expression (ORdecomposition). 
Speed-independent implementation
A boolean function f defines a bipartition {So, S l } of the set of states of SG, where So (Sl contains states in which f=O (f=l). To insert a signa f" that realizes function f , it is necessary to find two additional sets of the ER(b*), thus removing f"+ from t h e set of exists) or a fixed point is reached. It can be s 6 own Calculation of ER( f" +) stops either if at some step ER(f"+) intersects with So (then no legal E R f"+) [7] that the above procedure always finds the well- 
Start from E R ( f S + ) = I B ( f + )
Cost estimation
We now estimate the complexity of the circuit obtained after decomposition with respect to that of the original circuit.
Complezity of the implementation off"
It can be shown, by analyzing the MC conditions ([q) , that f" = f is a correct complete cover for a signal f" .
Events f o r which f" as not a trigger
The preconditions for these events are not modified by the insertion of f", and hence we can use the same implementation as before the decomposi tion4.
Events for which f" is a trigger
Here we have two different cases. This property is used as a heuristic filter to select candidate divisors that are guaranteed not to increase excessively the complexity of the implementation of other signals.
Note that this conservative estimation is applied not only for Case 2 but also for Case 1 when the substitution of an old trigger signal by a new one fails.
Example ha2ard.g. In the decomposition using dc (Figure 1,d) signal f" becomes a new trigger event to
2-without replacing any other trigger event. Hence the cover for z-will increase by one literal, while the cover for z+ will decrease by one literal. Hence this decomposition is not useful.
In the decomposition using hc event f"-is inserted before c-and replaces trigger event a+. Function for c-will not increase in complexity. The result of the decomposition by function Ec is shown in Figure 5 ,b.
Experimental results
The strategy for general logic decomposition previously presented has been implemented and applied to a set of different benchmarks. Results are shown in Table 1 .
We have measured the complexity of each gate as the number of literals required to implement it as a sum-of-product gate, either complemented or not.
Thus a 2-input XOR gate ( d + t i b )
is considered to be a 4-literal gate, whereas the function f = ab+ac+db+dc is also considered a 4-literal gate (f = lid + be). This model is slightly different from the one used in [4] in which the complexity was measured as the number of different inputs for FPGA lookup tables.
The first set of columns in Table 1 indicates the complexity of the circuit before decomposition. The second set of columns reports the number of signals inserted for decomposition using gates with at most i literals (i = 2,3,4). The next column summarizes the results presented by Siege1 [la] about the implementability of the circuit with only 2-input gates. All the implementations have been verified to be speedindependent, From the 32 examples, only G were not implemented (n. i. ) with 2-literal gates. Only one 6-input A N D gate in pe-send-ifc and two 5-literal gates in tsend-bm were not decomposed when attempting to implement these circuits with 4-literal gates. Our results show a significant improvement over those presented in [12] , and only one circuit (pe-rcv-if) could not be implemented with 2-literal gates from that benchmark suite. Global acknowledgement allows our method to effectively decompose complex gates with high fan-in (6 or 7 literals). This is shown by circuits like mri and vbelOb that were implemented with 2-literal gates. Figure G illustrates this fact, depicting the circuit vbelOb before and after logic decomposition into 2-literal gates.
The final columns present a rough estimation of the cost for speed-independence-preserving logic de- composition. The cost is evaluated as the number of -logic among different covers. In most cases extra cost However, and considering that the area of a C eleis added for the preservation of speed-independence.
!* X I
R1 X I ment is roughly equivalent to a 3-input AND gate, we independence can conclude is not that higher the than cost of 10% preserving of the area. speed- Conclusions and future work
In this paper we have shown a solution to the problem of multi-level logic synthesis and technology mapping for asynchronous speed-independent circuits. Let us summarise our two-step approach.
The first step Section 3.1) chooses a candidate for covers, sub-cubes etc. Different versions are evaluated Figure 6 : vbelOb before and after logic decomposition into 2-iiteral gates.
decomposition: a ( I gebraic kernels, non-cube-free sub-and the "best" one is taken. The "classical" combinational decomposition stops here. The second step (Sections 3.2 and 3.3) performs the actual decomposition -it attempts to find a speedindependent implementation as similar as possible to the candidate obtained at step (1). This is based on a bi-partitioning using SIP-insertion corresponding to signal x obtained by the first step. Functions for all signals are derived from scratch. Our complexity arguments in Section 3. 4 show that there is a good chance that x will get exactly the same function which was extracted at step (1). However, there is a chance also that this function will be smaller (thanks to boolean decomposition). Multiple acknowledgements for 2 appear automatically at this function generation step. Functions for signals which were not decomposed at step (1) may also change. Hence, the chosen implementation for z may correspond to a very general sequential decomposition. Moreover this is not a local, but "global" decomposition since other signals may change as well.
The method is implemented in the tool petrify. The results of the last section, to the best of our knowledge, show that the method appears to produce the most effective and efficient known decompositions of the standard set of asynchronous benchmarks (beating even the fundamental mode solutions). For example, examples such as vbelO and wrdatab have been decomposed for the first time into two-input AND gates by a software tool.
