This paper presents theory and practical implementation of a method for multi-level logic synthesis of speedindependent circuits. An initial circuit implementation is assumed to satisfy the monotonous cover conditions but is technology independent. The proposed method pedorms both combinational {inserting new gates) and sequential {inserting new memory elements) decomposition of complex gates in a given standard cell library, while preserving original behaviour and speed-independence. The algorithm applies known eficient algebraic factorization techniques from combinational multi-level logic synthesis, but achieves also boolean simplification and sequential decomposition. The method allows sharing of decomposed logic.
Introduction
Speed-independent circuits, originating from D.E. Muller's work [ 111, are hazard-free under the unbounded gate delay model. With recent progress in developing efficient analysis and synthesis techniques, supported by CAD tools, this sub-class has moved closer to practice, bearing in mind the advantages of speed-independent designs, such as their greater temporal robustness and self-checking properties.
Existing methods of logic synthesis for speedindependent circuits either assume that the implementation library contains and gates with unbounded fanin and "free" input inversions ([1,5,9]) or they use non-standard ''hazard absorbing" flip-flops whose effectiveness inpractice still needs to be evaluated ([ 141) . Other results on the implementability of semi-modular circuits without inputs using two-input/two-output and and or gates ([HI) are only interesting from a theoretical standpoint, due to their extremely high implementation cost.
In attempts to map speed-independent circuits into a more realistic, standard cell-like, library, other sort of re- strictions have been exercised. For example, the approach described in [16] works only under thefundamental mode assumption, which is overly restrictive and does not fit well theoretically with the unbounded delay assumption. The same authors describe in [15] a method to perform technology mapping for speed-independent circuits that only decomposes existing gates (e.g., a 3-input AND into two 2-inputANDs), without any further search of the implementation space. They do not explore complex decompositions, that could use multi-cube divisors, or decompose several gates simultaneously. The same limitations also affect the work of [l, 21. The idea of complete resynthesis of a circuit every time a new signal is inserted is exploited in [12] for the technology mapping of timed asynchronous circuits. However the search space for decomposition is again limited by a single signal network.
In [13] a method for technology mapping of speedindependent circuits using complex gates was presented. This method however only identifies when a set of simple logic gates can be implemented as a complex gate, but cannot perform a speed-independent decomposition of a signal function in case it does not fit into a single gate. In fact, this method can be used as a post-optimization step after our proposed decomposition technique.
Finally, Bums analyzes [4] the correctness conditions for a decomposition of a sequential element that is part of a speed-independent circuit into two sequential elements (or a sequential and a combinational element). Notably, these conditions are analyzed using the original (unexpanded) behavioural model, thus helping the efficiency of the method. This work is, in our opinion, a big step in the right direction, but addresses mainly correctness issues. It does not describe how to use the efficient correctness checks in an optimization loop, and does not allow the sharing of a decomposed gate by different signal networks.
The idea of combinational logic decomposition with resynthesis has been proposed in [8, 7] . The approach combines together efficient algebraic factorization techniques used in multi-level combinational logic synthesis (finding candidates for decomposition), and speed-independence preserving signal insertion (the latter idea originated in [ 171 and was implemented efficiently in [6] ).
The main contribution of this paper is a generalisation and extension of the above basic idea so as to cover both combinational and sequential decomposition. We have 0-8186-7922-0/97 $10.00 0 1997 IEEE developed a body of theory that allows us to prune the search space when looking for solutions. We continue to use classical logic synthesis techniques available for combinational multi-level logic in order to fiid good candidate functions for the decomposition. In the case of combinational decomposition the newly inserted signal is a library gate. The insertion of a combinational gate is based primarily on one of the two transitions of the gate's output (e.g., its rising transition). The other transition of the combinational gate is fully determined by the insertion place of the first transition.
A sequential decomposition, based on a new memory element, can improve the progress of mapping by rendering the opposite transition a more effective role, since the set and reset logic are inserted independently. In particular, two boolean functions can be decomposed at the 
Theoretical background
In this section we introduce theoretical concepts required for our decomposition method: (1) circuit specification and its logic implementability; (2) conditions for speedindependent decomposition of complex gates; and (3) transformations of state graphs to ensure those conditions.
State Graphs and Logic Implementability
A State Graph (SG) is a labeled directed graph whose nodes arc called states. Each arc of an SG is labeled with an event, that is a rising (a+) or falling (a-) transition of a signal a in the specified circuit. We also allow notation a* if we are not specific about the direction of the signal transition. Each state is labeled with a vector of signal values. An SG is consistent if its state labeling v : S --f (0, is such that: in every transition sequence from the initial state, rising and falling transitions altemate for each signal. The set of all signals whose transitions label SG arcs are partitioned into a (possibly empty) set of inputs, which come from the environment, and a set of outputs or state signals that must be implemented. In addition to consistency, the following two properties of a SG are needed for their implementability in a speed-independent logic circuit.
The first property is speedindependence. The second property, Complete State Coding (CSC), becomes necessary and sufficient for the existence of a logic circuit implementation. A consistent SG satisfies the CSC property if for every pair of states s,s' such that v ( s ) = v(s'), the set of output events enabled in both states is the same. (The SG in Figure 1 ,b is output-persistent and has CSC.) CSC does not however restrict the type of logic function implementing each signal. It requires that each signal is cast into a single atomic gate. The complexity of such a gate can however go beyond that provided in a concrete library or technology.
Gate-level implementability without hazards
Necessary and sufficient conditions for speedindependent implementation using unbounded fanin and gates (with unlimited input inversions), bounded fanin or gates and C elements were given in [ 1,9]. In this work we are considering a similar basic implementation architecture, called the standard-C architecture, which is described in Figure 2 . The difference from previous work is that instead of unbounded fanin gates for the set and reset logic of C-elements, we will allow only implementable gates, that is the gates which exist in the chosen library. The concepts of excitation and quiescent regions are essential for that. A set of states is called an excitation region (ER) for event a* (denoted by ERj(a*)) if it is a maximal connected set of states such that Vs E E Rj (a*) : s 5. Since any event a* can have several separated ERs, an index j is used for the distinction between different connected occurrences of a* in the SG.
The quiescent region (QR) (denoted by QRj(a*)) of a transition a*, with excitation region ERj(a*), is a maximal set of states s reachable from ERj(a*) such that a is stable in s and s is not reachable from any other ERk(a*) such that k # j without going through ERj (a*) '. Let Cj(a*) denote one of the first-level AND-OR gates in the standard-C architecture. Cj(a*) is a correct monotonous poly-term cove? for the excitation region ERj(a*) if the following three conditions are satisfied:
1. Cover condition: Cj(a*) covers all states of ERj(a*) (i.e., Cj(a*) evaluates to 1 in all states of ERj(a*)).
2, One-hot condition: Cj(a*) does not cover any state outside ERj(a*) U QRj(u*).
Monotonicity condition: Cj(a*) changes at most once
The conditions above are called the Monotonous Cover conditions or shortly the MC-conditions. Since under these conditions the outputs of the first-level gates are one-hot along any state sequence within QRj(a*).
'Note that contrary to [9, 11 in this paper we use only the so-called restricted quiescent regions which do not include states reachable directly from two different excitationregions of the same signal.
*Here for simplicity we consider the definition of Monotonous Cover without the extension by the so-called backward quiescent regions and without considering covering of multiple regions by the same cover. However all the results can be easily generalized for this extension as well.
encoded any valid Boolean decomposition of the secondlevel or gates is speed-independent.
The standard-C architecture permits a combinational implementation of a signal. If the set and reset networks are the complements of each other, then a C-element with identical inputs can be simplified to a wire (see Figure  2 ,b,c). In such case we say that the signal has a complete cover,
Property-preserving event insertion
Our decomposition method is essentially behavioural --the extraction of new signals at the structural (logic) level must be matched by an insertion of their transitions at the behavioural (SG) level. Event insertion is an operation on a SG which selects a subset of states, splits each of them into two states and creates, on the basis of these new states, an excitation region for a new event. Figure 3 shows the chosen insertion scheme, analogous to that used by most authors in the area [17] . State signal insertion must preserve the speedindependence of the original specification. An inserted signal is denoted by x in this paper. The corresponding to it events are denoted x*, x + , x -, or, if no confusion occurs, simply by 2. Let A be a SG and A' is a state graph obtained by insertion of event x . We say that an insertion state set E R ( x ) , in a SG A is a speed-independence preserving set (SIP-set) iff: (1) for each event a in A, if a is persistent in A, then it remains persistent in A', and (2) A' is deterministic and commutative. The formal conditions for the set of states r to be a SIP-set can be given in terms of intersections of r with the so-called state diamonds of SG [6]. These conditions are illustrated by Figure 4 , where all possible cases of the illegal intersections of r with state diamonds are shown.
It was shown in [6] that the insertion of a signal by means of a SIP-set is a necessary and sufficient condition to preserve the speed-independence of a corresponding SG. This requirement is the most general one in the synthesis of speed-independent circuits and it does not restrict the solution space unless we go beyond the speed-independent class. An efficient method for finding SIP-sets, which is based on regions, has been proposed in [6]. The first method for finding SIP-sets based on reduction to satisfiability problem was proposed in [ 171. Assume that the set of states S in a SG is partitioned into two subsets which are to be encoded by means of an additional signal. This new signal can be added either in order to satisfy the CSC condition, or to break up a complex gate into a set of smaller gates. In the latter case, a new signal represents the output of the intermediate gate added to the circuit. Let r and 7 = S -r denote the blocks of such a partition. For implementing such a partition we In this paper we shall consider the so-called input border of a partition block T , denoted by I B ( T ) , which is informally a subset of states of T by which T is entered. We call I B ( r ) wellformed if there are no arcs leading from states in T -IB(T) to states in IB(T). If a new signal is inserted using an input border, which is not well-formed, then the consistency property is violated. Therefore, if an input border is not well-formed, its well-formed speedindependent preserving closure is constructed, as described by Algorithm 4.1 in Section 4.
The insertion of a new signal can be formalized with the notion of I-partition ([17] used a similar defmition). so --+ s+ -+ s' t s-t so.
Decomposition techniques
We assume here familiarity with multi-level logic synthesis (see [3] for more details).
As described in the previous section, any deterministic, commutative, output-persistent SG satisfying the CSC and the Monotonous Cover conditions can be implemented using the standard-C architecture. We assume that Celements are present in the library ' . OR-gates combining cover functions C( a*) can be decomposed by any standard technique since their inputs are one-hot encoded. Hence the bottleneck for technology mapping is the implementation of cover functions C(a*) using gates available in the library.
As traditionally done in multi-level combinational synthesis, we have chosen algebraic division as the main operation for logic decomposition. Thus, for each cover function C( a*) we seek algebraic divisors, aiming at decompositions of the following ty C(a*) = F * G + R where G is the quotient C ( a * ) F ! AND-decomposition 31n fact our technique works and is implemented also for RS-and Dlatches. However, this generalization of the method is omitted due to the lack of space. In our decomposition technique transitions of z are acknowledged by several cover functions. This is more general and powerful than [lS, 41 where transitions of s must be acknowledged locally, only by the cover function C(a*) from which z is extracted. Multiple acknowledgment offers two advantages: (1) the same signal z can be shared by several cover functions (this corresponds to the extraction of common sub-dividers in classical multi-level decomposition) and (2) correct speed-independent decomposition can be found even if it does not exist for solutions with single acknowledgments (see the experimental results). Note that we do not specifically search for multiple acknowledgments. They appear automatically due to the signal insertion technique based on SIP-sets. Hence our solution is correct by construction and contrary to [2] never requires iterations with verification procedures.
To find good divisors F for C(a*) the following functions are considered:
0 If C(a*) is a poly-term cover, any subset of terms of the sum-of-product expression (OR-decomposition).
0 If C(a*) is one cube, any subset of literals of the cube (AND-decomposition).
0 Recursive decomposition of the previous candidates, e.g. sub-kemels and AND/OR-decomposition of kernels.
This generation of divisors is heuristically pruned to avoid an explosion of candidates for functions with many terms or cubes with many literals. Experimental results (Section 6) have shown this type of decomposition to be very effective. In particular, only those decompositions are considered that: (1) preserve speed-independence and (2) guarantee progress in mapping the circuit to the given library.
The first condition is satisfied by finding an I-partition for signal IC. Many candidates for decomposition are filtered out at this step, since for many divisors there are no valid I-partitions.
To clarify the second condition assume that function F is extracted from a cover function C( a*) for combinational decomposition (see Figure 5 ,b). If there is a valid I-partition for a new signal 2, then there is a speed-independent implementation for the circuit with signal IC. However, in general, there is no guarantee that function C(a*) is simplified in the new circuit. The substitution of 5 for F in C(a*) does not always preserve s ed-independence and hence new fan-in signals for C G ) can appear in the implementation. Thus, the progress condition checks whether a substitution of IC instead of F in C( a*) is valid.
Since multiple acknowledgment of 5 can appear, the requirement for "good decomposition" is following: the complexity of all (other than C(a*)) functions in 5's fan-out has to remain the same or to increase very moderately. In Section 4.3 we present a computationally efficient method for the estimation of effective decompositions.
The overall algorithm for logic decomposition is sketched below. The next sections describe each step in more detail.
Algorithm 3.1 (Speed-independent decomposition) while circuit is not mapped to the libmy do Calculate monotonous covers for all events; Let a* be the event with the most complex cover; Let D, be a set of divisors for C(a*); /* Kemels, co-kernels, AND/OR decomposition */ Let &be a set of divisors for the most complex cover functions other than C(a*); Figure 6 ,a. Our target is the decomposition of function S, into two-input gates, because it is a standard worst case against which the performance of a decomposition algorithm can be measured. Function S, consists of a single 3-literal cube iidc. It can be decomposed in three ways: by extracting functions iid, Ec and dc.
Example 2. For the cover C(y*) = ab+ac+de f the fol- 
Combinational d~c o~~o s i t i o n 4.1 State partitioning
In this section we apply the theory of SIP-insertion, reviewed in Section 2.3, to a divisor F of a given cover C(a*). We also say that signal b is a trigger signal for signal a and for event a*. All trigger signals for signal a must be included in the support of the logic function implementing a and hence each trigger signal will be in the fan-in of a. Triggers can be easily derived by observing ERs of a in the SG.
We can also show another property of trigger signals, that will be used to estimate the complexity of the logic after decomposition. The following property states that, if there is a well-formed SIP closure of the I B , then there is a minimal closure that has strictly less states than any other. Example ha2ard.g continued.
In the example (see Figure 8 ,a) E R ( x -) = (1011) (step 1). It is well-formed (step 2). At step 3 we will find that state 0011 E a f t e r ( a -, E R ( x -) ) and state 1001 E ). Therefore, {0011,1001} are in- Figure 8 ,b). State diamond illegally intersects E R ( x -) (step 4). To le alize this, the intersection state OOO1 is included in E R t x -) as shown in Figure8,c. 
Progress Analysis
If ER(x+) and E R ( x -) are derived, then there is a speed-independent implementation of the SG with a new signal x. However, to ensure progress in the technology mapping for the target cover function C ( m ) = P * G + R, we would like to have the following implementation in the new circuit: C(a*) = x * G + R (function F is substituted in this expression by one literal z )~. This is not always possible, since to preserve speed-independence, C ( a*) may require more fan-in signals. We will formulate progress conditions which will defiie when the implementation above is valid. This case is illustrated in Figure 10 . In SG A state s E ER(a-). However, in one of its images, s ' , signal a is equal to 1 and is stable, and therefore s ' E Q R ( u + ) A~. Hence, state s is in the inverse image of QR(u+)AI. The following procedure computes the inverse image for a quiescent region (by example of QR(ai+)A!). Before formulating progress conditions we present a useful property that captures conditions for signal x to have a constant value inside the excitation region of the original signal a in the new SG even if the excitation region for x* in the original SG contains states from the ER(a*) ( 2 , as before, denotes the signal which is inserted for decomposition). 
A symmetrical property holds for E R ( x -) .
The next proposition states the progress condition by presenting conditions for preserving monotonous cover conditions for substituting function F with one literal x in the cover function C(a*).
Proposition4.1 [7]
Let CA(U*) = F * G + R be a monotonous cover of ER(a*) in SG A. Let E R ( x + ) and E R ( x -) be the S+and S-sets for inserting a signal x obtained by Algorithm 4.1. Thefunction CAI ( U * ) = x*G+ R satisfies the three conditions for the monotonous cover in the new SG A', iff: . . 
Covercondition: after(a*, ( E R ( a * ) n F * G * f i -) ) n ER(x+)
=
247
SI has two images si and sy in A' such that si 2 sy, which implies that signal x has value 0 in si and value 1 in sy. Therefore, state si E E R A~( x + ) is not covered by CAI(U*) = x * G + R since both 2 * G and R have value 0 in si. The cover condition is violated.
Condition 2 ensures the one-hot condition for CA' ( U * ) in the new SG A'. Lets be outside ER(a*) U QR-'(a*).
If s E E R ( x -) n G in SG A, then in the new SG A', function x * G evaluates to 1 in the first image s ' of s (s' "J s"). Hence, for A', function x * G is evaluates to 1 outside ERA,(U*)UQRA~(U*), which violates the one-hot condition for the cover function CA' ( U * ) = x * G + R .
Condition 3 ensures the monotonicity condition for
CAI ( U * ) in SG A'. Condition 3(a) guarantees that CAI ( U * ) cannot make a non-monotonous transition of the ty e "1-0-1" along any path inside E R A~( u * )
U QRA'(a*y Set QR(a*) n F * G * contains the states of QR(a*) that are covered by F * G , but not by R , in SG A. Let Neither image is covered by R. Moreover since states of E R a*) are covered by CA,(,*) the cover function CA' (a* \ performs a non-monotonic transition 1-0-1 along a path within E R A I ( u * ) U QRA,(u*) (this path starts in ER(a*) and contains states s' and s"). Condition 3(b) ensures that CA' ( U * ) cannot make a nonmonotonous transition of the other type "0-1-0 along any path inside ERA, (a*) U QRA,(u*). Assume that there is at least one state, s, such that s E Q R ( u * )~! n E R ( x -) fl G and let its predecessor, SI, be covered neither by G nor by R . Then function CA,(,*) has value 0 in the image, si, of s1 (if s1 has two images, then CAI ( U * ) has value 0 in both).
State s has two images in A' (s' "s s") . Function x * G evaluates to 1 in the first one, s ' , and to 0 in the second one, s". Hence, function CAI (a*) performs a non-monotonous 0-1-0 transition along the path si -+ s ' -+ s" in A'.
Example hazard.g continued. All the conditions of Proposition 4.1 are satisfied for F = Ec and F = de and for both of them S, can be safely decomposed into two AND gates.
Cost estimation
The progress condition (if satisfied) guarantees that the implementation of a target cover function C(a*) will be simplified as a result of a decomposition. However, to accept a decomposition we need to check that it will not increase the complexity of logic for other events. We use a conservative estimate of logic complexity, in which trigger signals play a key role, in order to select candidates for decomposition.
All events (besides the target event a*) can be divided e Events x* of signal x in 3 groups:
It can be shown, by analyzing the MC conditions that x = F is a correct complete cover for a signal x.
The preconditions for these events are not modified by the insertion of x , and hence we can (in the e Events for which x* is not a trigger worst case) use the same implementation as before the decomposition. It is possible, though, that x can be used to further simplify the implementation of those signals as well, since the don't care set is increased.
e Events for which x* is a trigger, denoted by T T ( x ) .
For estimating complexity of such events the following procedure is used. This property is used as a heuristic filter to select candidate divisors that are guaranteed not to increase excessively the complexity of the implementation of other signals.
Example hazard.g continued. For a decomposition with F = dc (Figure 9 ,c) signal s becomes a new trigger for e-without replacing any other trigger. Hence the cover for z-will increase by one literal. A cover for z+ will decrease by one literal. This decomposition is not effective. If F = Zc is used, then event x-is inserted before cand replaces trigger event a+. Function for c-will not increase in complexity. The result of decomposition using function Ec is shown in Figure 6 ,b.
5 Sequential decomposition
Motivation
Combinational decomposition is limited, since signal insertion using bipartition { F 9 F } is based primarily on one of the two transitions of the gate's output (e.g., its rising transition). The other transition of the combinational gate is fully determined by the insertion place of the first transition. Moreover, if s substitutes F in cover function C(a*) = F * G + R, then in most cases, event x+ becomes a trigger to a* and is acknowledged by a* itself.
However, z-is often acknowledged by signals different from a, which may increase their complexity. Sequential decomposition, based on a new memory element, can improve the progress of mapping by allowing the opposite transition to play a more effective role, since the set and reset logic are inserted independently. In particular, two boolean functions can be decomposed at the same time with one new signal.
Assume that there are two functions C ( a*) = F * G + R and C(b*) = P * Q + T, which are not yet mapped in the library, and such that F * P =: 0. Then, in the sequential decomposition, a new signal x is inserted in such a way that x will go to 1 when F is changing from 0 to 1, and go to 0 when P is changing from 0 to 1. Then, both rising and falling transitions of z can be used to simplify the cover functions: x+ to simplify C(a*), and z-to simplify C(b*).
To illustrate that sequential decomposition can be more powerful than combinational decomposition, let us modify the hazard.g example, by declaring signal c to be an input (example hazard-m0d.g). As before, we would like to map the three-literal function S, =: Zdc into two-input gates. Since signal c is now an input, there are additional constraints for preserving the input/output interface. Indeed, we can no longer make the new event x-trigger for input c-. Derivation of ER(z+) and E R ( z -) for the example hazard-m0d.g and the combinational decomposition based on F = iic (that succeeded in hazard.g) is shown in Figure  11 ,a. ER(s-) for hazard-m0d.g includes 5 states (instead of 1 for hazard.g, cf. Figure 9 ,a) and event x-is no longer replacing any other trigger event of z-. Decomposition based on F = Zc makes the cover function for event z-even worse: 4 literals instead of 3 (hence it is not useful). Example hazard-m0d.g cannot be mapped using only combinational decomposition. Further we will refer to hazard-m0d.g to illustrate the steps of sequential decomposition.
State partitioning
Combinational decomposition using function F is based on bipartition of states into two blocks S F = { s : 
2.

3.
4.
5.
6. Note that the set D1 (and similarly DO) is constructed in two steps l(a.i) and l(a.ii), by first applying the backward, and then the forward reachability. If SG A is cyclic (such that the initial state SO is reachable from any other state of a SG), then step l(a.ii) can be omitted, since it does not produce any new states in D1. However, if SO is not a cyclic state for SG A and SO E SFF, then both traversals are needed to identify which set, D1 or Do, state so (and its successors) belongs to.
Algorithm 5.1 ensures consistency for the new signal x .
Step 2 checks that an path in SG A starting from IB(F) cannot reach ER(x+ywithout crossing E R ( x -) (or symmetrically a path from I B ( P ) cannot reach E R ( x -) without crossing ER(x+)).
Step 3 checks that D1 and DO have no states in common. These two checks guarantee that there are no cycles inside D1 U DO. Therefore, signal x can only perform consistent transitions in A' : 1* 4 0 4 o* + 1 4 1* 4 ....
The following property, which proof can be found in [SI, shows that Algorithm 5.1 is sound. 
Property5.1 Let I = {S+ = ER(z+),S',S-
Progress conditions
Sequential decomposition is aimed at mapping in the libr two non-implementable cover functions C ( a*) an C ( q . However, the decomposition can be useful if at least one of the functions is simplified. We can accept a sequential decom osition simplifying only one cover function (e.g., c(a*j'= F * G + R) in two main cases:
1. Combinational decomposition using F for function C(a*) failed because of the x -event, e.g., the SIP conditions are violated for E R ( x -) .
2. Combinational decomposition using F for function C(a*) is valid, but it makes logic for some other (than a*) events more complex (e.g., due to the acknowledging event x -) .
However, if none of the functions C(a*) or C(6*) is simplified by sequential decomposition, then it is rejected. Estimating progress for sequential decomposition is very similar to that for a combinational one. Similar to the combinational case, it can be shown that the complexity of these events cannot increase with the insertion of x .
Events for which x is a trigger. We either check that x can substitute for some other trigger signals in these cover functions (see Proposition 4.2) or (if this check fails) that cover functions can be implemented with at most one extra literal z (see Property 4.5).
Events for which x is not a trigger.
Example hazard-m0d.g continued. For the target event z+, the sequential decomposition with F = Zc and P = d satisfies the progress condition. It also does not disturb the implementability of event z-. Thus, the sequential decomposition is successful, while all combinational decompositions fail. The final implementation is shown in Figure 11 ,c.
Experimental results
The strategy for general logic decomposition presented above has been implemented and applied to a set of benchmarks. Results are shown in Table 1 .
We have measured the complexity of each gate as the number of literals required to be implemented as a sum-ofproduct gate, either complemented or not. Thus a 2-input EXOR gate (a5 + ab) is considered to be a 4-literal gate, whereas the function a b + ac + db + dc is also considered a 4-literal gate (d + &). This model is different from the one used in [4] where technology mapping was targeted at Figure 13 . For the STG of Figure 13 output signals e and y are implemented by 3-input AND gates. Our tool finds their decomposition into 2-input AND gates, in which both outputs e and y are used to acknowledge switchings of a new signal z. No valid decomposition (preserving speedindependence) exists when z is acknowledged by only one output (either y or z). The method from [4] looks for the decomposition within a single signal network and hence will fail to decompose 3-input AND gates.
The first set of columns in Table 1 indicates the complexity of the circuit before decomposition. The second set of columns reports the number of signals inserted for decomposition using gates with at most i literals (i = 2,3,4), and the CPU time required to find the solution (in seconds, for a Sparcstation 20). The number of inserted signals shows also the number of iterations in technology mapping --the circuit is resynthesized every time a new signal is inserted. The next column summarizes the results presented by Siegel [15] with only 2-input gates. All realizations have been verified to be speed-independent.
From the 32 examples, only 5 were not implemented ( n i ) with 2-literal gates. Only one 5-input AND gate in pe-send-ifc and two 5-literal gates in tsend-bm were not decomposed when attempting to implement these circuits with 4-literal gates. We significantly improve over the results presented in [15] , and only one circuit @e-rcv-$ I could not be realized with 2-literal gates from that benchmark suite.
The global-acknowledgment allows the method to effectively decompose complex gates with high fan-in (6 or 7 literals). This is shown by circuits like mrl and vbelOb that were implemented with 2-literal gates. Figure 14 illustrates this fact, depicting the circuit mrl before and after logic decomposition into 2-literal gates.
The effectiveness of sequential decomposition is illustrated in Figure 15 . The insertion of a new latch was crucial to allow the decomposition of a gate that could not be decomposed by the approach presented in [15] .
The final columns present a rough estimation of the cost for speed-independence-preserving logic decomposition. The cost is evaluated as the number of literals of the combinational gates and the number of C elements of the circuit. The column "non-SI" reports the cost of decomposing the original implementation of the circuit into 2-literal gates without preserving speed-independence (techdecomp -a 2 command in SIS). The column "SI" reports the cost of the decomposition preserving speedindependence. In some cases, such as vbe6a, the number of literals is reduced because the decomposition strategy allows sharing logic among different covers. In most cases extra cost is added to preserve speed-independence. However, if we consider that the area of a C element is roughly equivalent to a 3-input AND gate, we can conclude that the area cost of preserving speed-independence is not higher than 5%.
The last column shows, for the sake of comparison, the cost of performing technology mapping against a 2-input library (which is roughly the same as a 2-literal library) using the bounded wire delay model, after delay padding ([lo]). If we consider the cost of a C element to be 3 literals, the total cost of the speed-independent implementations in the 2-literal library is 640 + 109 x 3 = 967 literals, which is considerably smaller (and probably faster, because there is no need to add delay buffers).
Conclusions and future work
In this paper we have shown a solution to the problem of multi-level logic synthesis and technology mapping for asynchronous speed-independent circuits. The method is based on both combinational and sequential decomposition, for each of which we apply a two-step approach.
The first step (Section 3) chooses a candidate for decomposition: algebraic kernels, non-cube-free sub-SOPS, sub-cubes etc. Different versions are evaluated and the "best" is taken --say, it corresponds to the new signal z. Combinational decomposition for synchronous circuits stops here. In the case of sequential decomposition two candidates are considered simultaneously. The second step (Sections 4 and 5) performs actual decomposition --it attempts to fmd an optimized speedindependent implementation based on the candidate obtained at step (1). This is based on partitioning the state space into four sets, in which signal z is stable and is changing, while ensuring the speed-independence of the expanded specification (a necessary condition for speedindependent implementability). A new implementation is then derived for each signal, thus achieving global optimization and acknowledgement. The complexity arguments in Section 4.3 show that there is a good chance that z will get exactly the same function which was extracted at step (1). However, there is a chance also that this function will be smaller (thanks to boolean decomposition). Multiple acknowledgments for z appear automatically at this function generation step. Functions for signals which were not decomposed at step (1) may also change. Whenever a combinational decomposition fails to simplify the overall complexity (due to the lack of control in the insertion of the opposite transition z-), the procedure applies a sequential decomposition (where z-is used to simplify one more cover). As a result, the actual function for z may correspond to a very general sequential decomposition. Moreover this is not a local, but "global" decomposition since other signals may change as well.
The method is implemented in the tool petrify. The results shown in the last section, to the best of our knowledge, show that the method appears to be the most effective and efficient amongst those available to date for the standard set of asynchronous benchmarks For example, it is for the fxst time that such examples as vbelO and wrdatab have been decomposed into two-input AND gates by a software tool.
We are currently working at improving the method to make it complete (i. the largest class of State Graphs that can be implemented in a given library) and at extending the basic implementation architecture to other types of sequential elements, such as S/R flip-flops or D latches.
