Logic decomposition is a well-known problem in logic synthesis
I. INTRODUCTION
Asynchronous-circuit design has traditionally been considered as a sort of "black magic" that could be tackled only by hand and at great cost. On the other hand, asynchronous circuits offer a few advantages over their synchronous counterparts that explain a recent revival of interest in asynchronous design techniques. The ultimate goal of asynchronous computer-aided design (CAD) research is to create a design flow that is as easy to use by designers as the standard synchronous synthesis-based flow.
Unfortunately the separation between function and timing, which is the key to the success of synchronous design techniques, is much more problematic in the asynchronous case. Hazards, i.e., unexpected transitions on gate outputs, are not filtered out by letting the combinational logic stabilize before clocking registers, but they must be avoided by a careful design of the logic and of its timing.
Three main classes of asynchronous circuits avoid hazards purely by simple timing assumptions and by logical means, thus preserving as much as possible the above mentioned separation.
1) Fundamental mode circuits [1] , [2] , [3] assume that the environment of a circuit is so slow that the logic has time to stabilize before inputs can change again. Intuitively a fundamental mode circuit behaves similar to a synchronous one with a clock rate defined by the arrival of input patterns. Therefore the problem of avoiding hazards has a much simpler solution that allows one to apply conventional design methods known from the synchronous world. In particular the decomposition of gates using algebraic techniques does not lead to any circuit malfunctions as long as the fundamental mode assumption remains valid [4] . However, despite being simple and convenient, this assumption reduces the amount of concurrent activity that can take place in a circuit with potentially loosely coupled interfaces. 2) Delay-insensitive circuits [5] make no assumptions on the delays of logic blocks and wires. Such circuits are often synthesized by syntax-directed methods, using nonstandard libraries of relatively large control blocks [6] , [7] , followed by limited peephole optimizations [8] . 3) Speed-independent circuits [9] , which can be built by using conventional logic gates, assume that the skew in the delays of fanout branches is smaller than the delays of the logic gates. circuits recent research has developed a variety of logic synthesis techniques that allow one to trade off synthesis speed (as in the case of the fast heuristics developed by [10] , [11] , and [12] ) and optimality (as in the case of the more powerful and expensive techniques developed by [13] and [14] ). Even though the underlying assumption may seem at the same time pessimistic (about gates) and optimistic (about wires), recent results [15] suggest that delayaware postoptimizations may further improve the quality of the synthesized results. Moreover, delay tuning [16] and low-skew routing [17] may help satisfy the hypothesis about the low wire skew.
The main problem of logic synthesis for speedindependent circuits is that they assume nonstandard implementation libraries, such as arbitrarily complex Boolean gates [18] or arbitrary fanin AND gates [13] , [14] . Standard logic decomposition followed by technology mapping is not applicable here, because arbitrary decomposition of a large gate may introduce hazards [1] , [4] . This paper is aimed exactly at solving that problem by defining speed-independence-preserving decomposition of large logic gates into smaller ones, carried down to twoinput NAND or NOR gates, on which standard technology mapping techniques operate [19] . The proposed decomposition method guarantees that every transition of a new signal resulting from decomposition is acknowledged by some other signal in the circuit, in order to avoid hazards. This is achieved in two major steps: 1) finding a logic decomposition of the complex gate circuit based on algebraic factorization originally proposed for combinational logic [19] ; 2) inserting new signals whilst preserving overall hazard freedom, based on an idea proposed in [20] and on the efficient implementation techniques described in [21] .
Both these parts are provided with an appropriate progress condition check and cost function evaluation, so as to achieve global optimization. The method has been embedded into the overall synthesis procedure implemented in the software tool "petrify" [22] . This publicly available tool 1 can perform various Petri net manipulations [23] , as well as solve the state encoding problem [21] and derive a speed-independent technology-mapped implementation of asynchronous circuits. It contains about 50 000 lines of C code and runs on different UNIX platforms and MS-Windows. Most of the implemented algorithms use symbolic BDD-based representations of the state space [24] . Currently, "petrify" is being used by different universities and industries for their research in asynchronous circuits. Few projects have also been developed with the assistance of "petrify" for the design of control logic. Among them, it is worthwhile 1 to mention the AMULET asynchronous implementation of the ARM microprocessor.
A. Comparison With Previous Work
The approaches described in [25] and [26] work only under the fundamental mode assumption, which is often too restrictive as discussed above.
The method to perform technology mapping for speedindependent circuits described in [27] decomposes existing gates (e.g., a three-input AND into two two-input AND's), without any further search of the implementation space. It does not explore complex decompositions, which could use multicube divisors, or decompose several gates simultaneously.
The work of [28] treated the decomposition problem for speed-independent circuits in a manner that is similar to the method presented in this paper, by inserting new signals that implement subfunctions of complex gates. However, explicit insertion of new signals is, as we will discuss later, too expensive in the core of an optimization loop. Moreover, efficient filters are required to limit the originally huge search space.
The approach of [29] allows almost any Boolean logic optimization available from the synchronous world and assumes the use of the hazard-absorbing MHS-flops. These special-purpose flip-flops need to be designed very carefully by hand, with extensive failure-analysis tests, before they can be reliably used in practice. Moreover, their correct operation relies on the assumption that NOT gate delays are negligible with respect to AND gate delays (in contrast, most other work in the area assumes that they are smaller than AND gate delays).
Finally, Burns [30] analyzes the correctness conditions for a decomposition of a sequential element that is part of a speed-independent circuit into two sequential elements (or a sequential and a combinational element). Notably, these conditions are analyzed using the original (unexpanded) behavioral model, thus helping the efficiency of the method. Burns' work is, in our opinion, a significant step in the right direction, but it addresses mainly correctness issues. It does not describe how to use the efficient correctness checks in an automated optimization loop, and it does not allow the sharing of a decomposed gate by commondivisor extraction.
The method presented in this paper, on the other hand: 1) allows one to automatically search for a solution aimed at a given library (e.g., with specific maximum gate fanin restrictions); 2) exploits logic sharing based on multiway acknowledgment; 3) performs global optimization via resynthesis (rather than sequential decomposition). The rest of the paper is organized as follows. Section II contains a theoretical background to facilitate subsequent understanding of the method. Section III presents an overview of the method and a simple example. Section IV provides a more detailed view of the logic decomposition and speed-independent signal insertion algorithms. Section V presents results of experiments on a set of benchmarks, obtained from various recent publications on asynchronous circuit synthesis. Section VI concludes the work.
II. THEORETICAL BACKGROUND
In this section, an overview of the synthesis flow for speed-independent circuits is presented and illustrated with a design example. Throughout the paper we assume that the reader is familiar with multilevel logic synthesis [19] , [31] .
A. Circuit Specification
Signal transition graphs (STG's) [18] , [32] are a class of interpreted Petri Nets [33] , [34] that allow the designer to comfortably capture the behavior of an asynchronous circuit in a manner that is quite similar to timing diagrams.
As an example,consider Fig. 1 , which depicts the interface of a device with a VME bus. The behavior of the controller is as follows: a request to read from or write into the device is received by one of the signals, DSr or DSw, respectively. In a read cycle, a request to read is done through signal LDS. When the device has the data ready (LDTACK), the controller must open the transceiver to transfer data to the bus (signal ). In the write cycle, data are first transferred to the device . Next, a request to write is done (LDS). Once the device acknowledges the reception of the data (LDTACK), the transceiver must be closed to isolate the device from the bus. Each transaction must be completed by a return-to-zero of all interface signals, seeking for a maximum parallelism between the bus and the device operations. Fig. 2 shows a timing diagram of the read cycle and Fig. 3 the corresponding STG. All events in this STG are interpreted as signal transitions: rising and falling edges are labeled with " " and " " respectively. 2 An STG has two types of vertices: transitions (denoted by boxes) and places (circles). Places can be marked with tokens (black dots). The set of all places currently marked is called a marking. A transition is enabled if all its input 2 We also use the notation a3 if we are not specific about the direction of the signal transition. places contain a token. In the initial marking of the STG in Fig. 3 only one transition, DSr , is enabled; LDS is not enabled because its input place does not have a token. Every enabled transition can fire. Firing removes one token from every input place of the transition and adds one token to each of its output places. After the firing of transition DSr , the net moves to a new marking and then becomes enabled, while other transitions (none in this case) sharing the same input place(s) may be disabled due to the lack of input tokens. Transitions are called concurrent if they both can fire from some marking without disabling each other.
B. State Graphs
Playing the token game one can generate the reachability graph (RG) with vertices corresponding to markings and arcs to transitions between markings. Fig. 4 depicts the RG for the READ cycle of the VME bus controller.
Each state of the RG can also be associated with a binary code of signal values, which label the states in Fig. 4 . Enabled signals in each state are marked with a prime 3 . An RG with binary encoding is called a state graph (SG) of an STG. State graphs are of primary importance since they form the basis of logic synthesis for asynchronous circuits [18] .
C. Implementability Properties
The following properties must hold in an SG to be implementable as a speed-independent circuit [9] 4 holds when a) no noninput signal transition can be disabled by another signal transition and b) no input signal transition can be disabled by a noninput signal transition. The former ensures that no short glitches, known as hazards, can appear at the gate outputs, while the latter ensures that no hazards can occur at inputs of the device.
The speed-independence property is often associated with the notion of acknowledgment. Informally, we say that transition acknowledges transition if the fact that fires after has been enabled indicates that has already fired. We say that is acknowledged if any firing sequence starting from enabled is acknowledged by some transition. 
D. Logic Synthesis
The goal of logic synthesis is to derive a gate netlist that implements the behavior defined by the specification. For the sake of simplicity, this step will be illustrated by synthesizing a speed-independent circuit for the read cycle of the VME bus (see Fig. 3 ).
The main steps in logic synthesis assume that the SG is consistent and speed independent and are as follows:
1) encode the SG in such a way that the complete state coding property holds; this may require the addition of internal signals; 2) derive the next-state functions for each output and internal signal of the circuit; 3) map the functions onto a netlist of gates.
1) Next-State Functions:
The next-state function for a signal is defined as follows. It maps the binary code of each SG state into: 1) 1 if the signal has value 0 or 1 in the binary code of (it is either excited to go to 1, or stable at 1); 2) 0 if the signal has value 1 or 0 in the binary code of ; 3) -(don't care) for all binary codes that do not correspond to any reachable SG state. Table 1 
2) Complete State Coding (CSC):
The previous definition, however, has a problem, shown by the two underlined states in the SG of Fig. 4 . They correspond to different markings, and , but their binary codes are equal, 10 110. Moreover, enabling conditions in these two states for output signal LDS are different. Therefore, the value of the next state boolean function for signal LDS for vector 10110 should be 1 (for the first state) and 0 (for the second state). A similar problem holds for signal . This is a conflict in the definition of the function. A possible method to solve this problem is to insert new state signals that disambiguate the encoding conflicts. Fig. 5 depicts a new SG in which a new signal, csc0, has been inserted. Now the next-state functions for signals LDS and can be uniquely defined. The insertion of new signals must be done in such a way that the resulting SG satisfies consistency and speed-independence, as discussed in [20] and [21] .
3) Next-State Function Implementation: Once the nextstate function has been derived, Boolean minimization can be performed to obtain a logic equation that implements the behavior of the signal. In this step it is crucial to make an efficient use of the don't care conditions. For the example of Fig. 5 , the following equations can be obtained:
The properties of the SG described in Section II-C ensure that any circuit implementing the next-state function of each signal with only one atomic complex gate is speed independent [9] (by atomic gate we mean a gate without internal hazardous behavior). A possible hazard-free gate implementation for the next-state function of the READ cycle example is shown in Fig. 6 , where the gate shown as a circle with "C" is a so-called C-element [9] with next state function . However, this design flow has an essential problem, because logic functions for signals might be too complex to be mapped into single gates available in the library. Hence the need for decomposition arises.
E. Gate-Level Implementability Without Hazards
In this paper, we develop a decomposition method based on the use of standard architectures. In particular, we concentrate on the standard-C architecture, which is described in Fig. 7 (a) (multiple AND-OR gates can exist for both setting and resetting the output). A synthesis method based on this architecture was first suggested in [13] and [35] . This method defines an implementation condition that is equivalent to the monotonic cover (MC) conditions [14] that we use, but only for the case of decomposition into simple gates. In our work we use the more general MC conditions because they allow one to: 1) consider a wider class of specifications (allowing both AND and OR causality) and 2) extend the basic theory to support more aggressive optimizations detailed in [36] .
In the rest of this paper we will show how to use only implementable gates, that is gates which exist in the chosen library, instead of the unbounded fanin gates assumed by the above methods.
1) Excitation and Quiescent Regions:
Given a signal , we can classify the states of the SG into the following sets:
1) positive and negative excitation regions (ER's); 2) positive and negative quiescent regions (QR's). A set of states is called an ER for event (denoted by ER ) if it is a maximal connected set of states in which is enabled. Since any event can have several separated ER's, an index is used for the distinction between different connected occurrences of in the SG. The QR (denoted by QR ) of a transition , with excitation region ER , is a maximal set of states reachable from ER such that is stable (not enabled) in and is not reachable from any other ER without going through ER . Examples of ER and QR for signal LDS are shown in Fig. 5 .
2) MC Conditions: Let denote one of the firstlevel AND-OR gates in the standard-C architecture. is a correct monotonic poly-term cover for the excitation region ER if the following three conditions are satisfied.
1) Cover condition:
covers all states of ER (i.e., evaluates to 1 in all states of ER ).
2) One-hot condition:
does not cover any state outside ER QR .
3) Monotonicity condition:
can fall at most once along any state sequence within QR .
The meaning of the MC conditions can be understood by considering the operation of a speed-independent circuit. Suppose, for example, that at some point during circuit operation we enter a state belonging to ER . The cover condition ensures that the gate implementing function should go from 0 to 1 in that state. The second condition guarantees that no other gates in the signal network of and no gates in the signal network of can be at 1 at that moment. Therefore is the only gate in the network of signal with the output value 1 and the propagation of this value to the output of (through the OR gate and the C-element) gives a complete information on (acknowledges) the switchings in the network. When signal changes its value from 0 to 1, the circuit moves from the ER ER into the QR QR . In this region, according to MC, gate will be reset and this switching (the only one possible in QR ) will be implicitly acknowledged when will go low. Since under these conditions the outputs of the first-level AND gates are one-hot encoded, any valid Boolean decomposition of the second-level OR gates is speed independent.
The standard-C architecture also permits a combinational implementation of a signal. If the set and reset networks are the complements of each other, then a C-element with identical inputs can be simplified to a wire [see Fig. 7 (b) and (c)] 5 .
III. DECOMPOSITION METHOD As described in Section II, any speed-independent (i.e., output-persistent) SG satisfying the CSC condition can be implemented using the standard-C architecture. This guarantees that a correct boolean equation can be obtained for each cover . However it does not guarantee that can be implemented by one of the gates in the library.
To perform technology mapping, complex gates must be decomposed until all their fragments are mappable onto library gates. The problem of decomposition of combinational circuits is well known, but the methods are not directly applicable to speed-independent circuits. The decomposition of a gate into smaller gates implicitly introduces new internal signals (with delays associated with these new gates) that may cause hazards.
The approach proposed in this work splits the problem of logic decomposition of a gate into two subproblems: 1) combinational decomposition; 2) insertion of a new hazard-free signal. 5 More precise condition for such an optimization can be formulated as:
the set network covers all states of ER j (a+) [ QR j (a+) for all j, or similarly the reset network covers all states of ER j (a0) [ QR j (a0). This process is iterated until all gates of the circuit can be mapped onto library gates or no more progress can be achieved, e.g., because no hazard-free decomposition can be found for any of the complex gates. Each subproblem is briefly described in the forthcoming sections.
A. Combinational Decomposition
As is traditionally done in multilevel combinational synthesis, algebraic division has been chosen as the main operation for logic decomposition. For each cover function we look for algebraic divisors, aiming at decompositions of the following type: where is the quotient , as shown in Fig. 8 . In this figure, on the left is an atomic complex gate with function , while on the right it is an atomic complex gate with (simpler) function . This decomposition scheme reduces to AND-decomposition when and OR-decomposition when
. Different examples of algebraic division are shown in Table 2 .
1) Example: Fig. 9 (a) and (b) depicts the STG and the SG of the specification of a circuit. A complex gate implementation of the circuit is shown in Fig. 10(a) .
Let us assume that only two-input gates are available in the library. Thus, signals and are not directly mappable and must be decomposed. Contrary to synchronous circuits, Fig. 10(a) by extracting the algebraic divisor . The ON-and OFF-sets of the function for are shown in Fig. 9 (e) by shadowed areas. When the circuit enters state 0000 [underlined in Fig. 9(e) ], two transitions may occur concurrently:
and . Firing first will enable gate to make a transition from low to high, while pulls the output of the gate again to low. In a speed-independent circuit, no assumptions can be made about the relative speed of concurrent transitions, and therefore the considered situation is a classical illustration of hazardous behavior on the output of gate . Hence, the decomposition is invalid.
B. Insertion of Hazard-Free Signal
Each divisor of is a candidate function to be implemented as a new signal of the circuit. The new signal will be hazard-free if all its transitions are acknowledged by other signals of the circuit. In the technique presented in this paper, transitions of may be acknowledged by several signals. This is more general and powerful than [27] and [30] where transitions of must be acknowledged locally, only by the same signal from whose cover was extracted.
Multiple acknowledgment offers two advantages:
1) the same signal can be shared by several cover functions (this corresponds to the extraction of common divisors in classical multilevel decomposition); 2) correct speed-independent decomposition can be found even if it does not exist for solutions with single acknowledgments (as shown by the experimental results).
Hazard freedom is guaranteed for the new signal as follows. Two new events, namely and , are inserted in the SG so that the properties for speed-independent implementability are preserved. The new events are defined in such a way that the implementation of signal corresponds to the selected divisor for decomposition. If and can be inserted under such conditions, is hazard-free. Now can be used as a new signal in the support of any function cover and contribute to derive simpler equations. Care must be taken not to increase the complexity of other cover functions (Section IV-C).
1) Example (Continued):
Let us consider again the example of Fig. 9 and look for a hazard-free decomposition. Among the different algebraic divisors for and , there is one that looks especially interesting for a possible sharing of logic:
. The insertion of the events and must be done according to the implementation of the signal as . The shadowed areas in Fig. 9(b) indicate the sets of states in which the Boolean function is equal to 0 and 1, respectively. must implement the transition from the states in which is equal to 0 to the states in which is equal to 1, i.e. must be a successor of , whereas must implement the opposite transition and therefore is inserted after . Fig. 9 (c) and (d) depicts two possible insertions of signal at the STG level. Both insertions result in specifications that are implementable as different speed-independent circuits (shown in Figs. 10(b) and (c) respectively) . Interestingly, both can be implemented with only 2-input gates. However, the insertion of as a predecessor of and [ Fig. 9(c) ] changes the implementation of signal , because the fact that triggers forces to be in the support of any realization of . A simpler circuit can be obtained if is made concurrent with and thus only trigger [ Fig. 9(d) ]. In the resulting circuit, signal is only in the support of and , i.e., of those signals that acknowledge the transitions of .
Therefore, the insertion of new signals for logic decomposition can be done by exploring different degrees of concurrency with regard to the behavior of the rest of the signals. Finding the best tradeoff between concurrency and logic optimization is one of the crucial problems in the decomposition of speed-independent circuits that can be explored by using our method and that makes it different from previous work (e.g., [28] ). IV. DECOMPOSITION TECHNIQUES The generation of divisors for decomposition should be pruned to avoid an explosion of candidates for complex functions.
Two conditions help in constructing an efficient filter of solutions in the huge decomposition space. Only those, decompositions are considered valid which: 1) do not introduce hazards (i.e., preserve speedindependence); 2) heuristically guarantee progress in mapping the circuit to the given library.
The above conditions could be verified in a straightforward (and inefficient) way for every function used for decomposition, as was proposed in [28] . We could explicitly insert a new signal , with logic function , into the original SG and then check whether the modified SG satisfies these conditions. However, there are several reasons that make such a naive approach hardly acceptable. It was already mentioned that for complex functions the number of divisors can be huge. The situation is even worse because another dimension of complexity arises from the fact that for the same function a new signal can be inserted in many different ways [see, e.g., two different insertions for in Fig. 9 (c) and (d)]. Taking into account that the construction of a new SG is computationally expensive, there is no way to get an efficient implementation by the above straightforward approach.
Better results can be obtained if one checks both speed-independence (as proposed in [30] and discussed in Sections IV-A and IV-B) and progress conditions (as discussed in Section IV-C) directly in the original SG.
A. Property-Preserving Event Insertion
Event insertion is the operation on an SG which assigns a subset of states to be an excitation region for a new event. A new event can fire from any state of an excitation region ER . Hence in the SG that is obtained after the insertion of a new event each state of ER is split into two: before and after the firing of event (see Fig. 11 ). This operation was defined and implemented in [20] , [21] in the context of modifying an SG to satisfy the CSC property.
When a new signal is inserted into the SG the value of its logic function defines a natural bipartition over the set of SG states:
and [see Fig. 12(a) An insertion that satisfies all these conditions will be called valid. Let us consider each requirement of valid insertion separately.
1) Speed-Independence: If the insertion of preserves the speed-independence of the SG the corresponding set of states ER is called a speed-independence preserving (SIP) set. The formal conditions for the set of states ER to be a SIP set can be given in terms of intersections between ER and the so-called state diamonds of the SG [21] , which are quadruples of states obtained via the interleaving of concurrent events and . These conditions are illustrated by Fig. 13 , where all possible cases of the illegal intersections of ER with state diamonds are shown. It has been shown in [21] that any illegal intersection results in the insertion of a new signal with a hazardous behavior. For example, in the case of Fig. 13(b) it results in the invalid decomposition in Fig. 9(e) .
It is easy to see that any illegal intersection can be transformed into a legal one by adding states from the relevant diamond into ER . For example, adding state to ER in Fig. 13(b) transforms ER into a SIP set and the insertion of signal becomes hazard-free. However, such transformation may change the function for signal and is not always possible if the function, as in the case of decomposition considered in this paper, is fixed.
2) Consistency: The only consistent changing sequence for is:
Bearing in mind that in any state of IB is going to rise (i.e., these are states in which ), consistency is violated if IB is entered from a state in which (similar considerations apply to IB and ). The simplest way to avoid this problem is to expand IB by including state into it. This will make transitions from state internal to IB , that is no longer dangerous for consistency.
3) CSC: CSC is guaranteed for the newly inserted signal , since its excitation and quiescent regions are defined based on a Boolean function , that is the next-state function of . It is also easy to show that CSC is not changed for old signals, since: 1) any signal whose transitions are not delayed by is obviously unaffected; 2) any signal whose transitions are delayed by can have CSC conflicts only due to the states which have been split by the insertion of , but these states have different binary labels in signal .
4) I/O Interface: When signal is inserted into the SG using ER
, the events with which ER is exited are delayed until fires (see, e.g., event in Fig. 11 ). If such an exit event is associated with an input signal, the environment is forced to wait before it can change this input until is observed. Signal thus becomes a primary output of the circuit. This is against the idea of keeping the I/O interface intact during the decomposition. To preserve the I/O interface we can use the same remedy as for consistency violations: whenever input signals exit ER , the latter must be expanded until all the exit signals are noninputs.
B. Finding a Valid Excitation Region
An efficient procedure that finds a valid set ER [and similarly for ER ] given a function for can be organized as follows (see [21] for more details). ER is initialized to be the input border IB of . If the initial ER is consistent, preserves I/O interface and is an SIP set, then a valid insertion has been found. If one of the validity conditions does not hold, the ER is expanded toward the states with . The expansion is done in such a way that the insertion of the new signal preserves speed independence for all signals.
As an example, let us consider ER in Fig. 13(b) . The insertion of would not preserve speed independence for . There are two ways of locally expanding ER to overcome this problem: 1) by including , thus delaying and being concurrent with and 2) by including , thus being delayed by and concurrent with . In the proposed approach only forward expansions are considered and, therefore, the latter would be applied. The expansion is iteratively performed until a satisfactory solution is found. As shown in Fig. 12(b) , by expanding ER we can:
1) either obtain a set which is SIP, consistent, and preserving I/O interface (which gives the valid insertion we are looking for); 2) or reach the boundaries of the set of states with with some remaining violations.
The latter implies that there exists no valid insertion of signal with the partition implied by logic function , and the decomposition based on must be rejected. Note that the solution found in the expansion of ER (if any) produces a unique valid ER which has the minimum size, as shown in [21] . This solution, however, may not be optimal, and further expansion can be applied, e.g., to increase the amount of concurrency for , while preserving the above conditions. 1) Example (Continued): Fig. 9 shows the decomposition based on the insertion of a new signal with function . Let us explore different ways to select the excitation regions of . The set of states with is entered through state 1001, while the set with is entered through 0111. Hence IB while IB . Both input borders are SIP sets, satisfy consistency, and their exit events correspond to output signals. Therefore IB and IB give valid excitation regions for the insertion of signal , and these regions have the minimum size among all valid insertions. The corresponding STG and implementation of signal were shown in Figs. 9(c) and 10(b), respectively. The implementation of requires the acknowledgment of transitions of by gates , and . This makes the function of more complex than in the original SG.
To simplify the implementation let us consider the expansion of ER within the set of states . The first case of expansion is shown in Fig. 14(b Fig. 10(b) . ER can be expanded further, e.g., by including state 0001 as shown in Fig. 14(c) . This, however, leads to an illegal intersection with the state diamond , which violates speed independence. To solve this problem state 0000 must also be included into ER . After that, ER illegally intersects the state diamond , which in its own turn can be fixed by adding 0010 to ER . The latter gives a valid selection of ER with the corresponding STG and circuit shown in Fig. 14(d) and (e). This example shows how, starting from IB , the excitation region for transition is expanded to satisfy the conditions of Section IV-A. It results in a successful decomposition because in the procedure of expansion no states in which were required to be included in ER . We have thus identified a way of finding a correct position in the state graph to insert a new signal for a given Boolean decomposition. In the Section IV-C we shall look at the conditions that heuristically guarantee progress toward the overall goal of decomposing all gates that do not belong to the target library.
C. Progress Analysis
This section investigates a heuristic procedure that explores the huge optimization space by quasi-greedy optimization of a two-level cost function. The cost function is split into two levels in order to make sure that: 1) the newly inserted signal allows a correct speedindependent factorization of the complex gate; as we will see below, this is not always the case, even if is speed independent, since excitation region expansion may cause the factored gate to be more complex than expected; 2) other signals do not increase in cost "too much" due to the need to acknowledge the transitions of ; allowing some increase in cost is sometimes necessary in order to escape from local minima.
Both cost estimations must be performed on the original SG, without explicitly adding the new signal, in order to keep the execution time of the decomposition algorithm within reasonable limits. We will call a reduction of the former local progress and a reduction of the latter global progress, and examine each one in turn after a motivating example.
Let the target cover function be , in which is the candidate for extraction. At first we should find valid excitation regions for the new signal . If such ER and ER , can be derived, as discussed in Section IV-B, then there is a speed-independent implementation of the SG with a new signal . The purpose of the insertion of signal is to simplify the cover function by substituting with . This purely algebraic simplification is, however, not always possible in the asynchronous case, since in order to preserve speed independence, may now require more fanin signals. Let us illustrate this by considering our example. Fig. 15 shows one of the speed-independent insertions of signal based on extracting function out of the functions for signals and for the initial specification given in Fig. 9 . This insertion corresponds to selecting excitation regions as follows [see Fig. 15(a) ]: ER and ER . The STG with the new signal inserted, the new state graph, and the new circuit are shown in Fig. 15(b), (d) , and (c), respectively.
1) Example (Continued):
Although function has been extracted, it does not help to reduce the literal count for the target cover of signal . It still has three literals: , because must be an input to both gates for and . To avoid confusion, note that the circuit with a two input gate implementation for signal shown in Fig. 10(b) does not correspond to the selected ER , as can be seen by comparing STG's from Fig. 9(c) and Fig. 15(d) . In the former, precedes , while in the latter, is concurrent with . The most natural implementation (based on algebraic factoring) is unfortunately incorrect, as shown in Fig. 15(d) , because function covers four states of the new state graph, while there are only two states in which signal has an implied value equal to one. Hence function does not provide a correct cover for signal .
2) Local Progress Conditions:
The local progress condition has the purpose of verifying that the algebraic decomposition is indeed a valid speed-independent decomposition. Thus we must check that satisfies in the new SG all three MC conditions defined in Section II-C. This is true, informally, if and only if the following apply. We will now formulate the local progress condition by presenting conditions for preserving monotonic cover conditions for substituting function with one literal in the cover function . If these conditions, formulated in terms of the original state graph, are satisfied, then simplification for gate implementing signal is guaranteed to be possible.
For a given event and a set of states we define the set of states following immediately after as follows: 2) One-hot condition: for all : if ER QR , then ER ; 3) Monotonicity conditions:
3) Global Progress Conditions:
The local progress conditions (if satisfied) guarantee that the implementation of cover function will be simplified as a result of decomposition. However, to accept a decomposition we need to ensure that it does not significantly increase the complexity of logic for other signals. trigger signals after the insertion will never be more complex than the original one.
The reader is referred to [37] for more details and more sophisticated methods for global progress estimation.
V. EXPERIMENTAL RESULTS
The strategy for algebraic decomposition presented above has been implemented in the CAD tool "petrify." Algebraic decomposition has been applied to a set of benchmarks, and the results are shown in Table 3 .
We measured the complexity of each gate as the number of literals required to implement it as a sum-of-product gate, ignoring the cost of input and output inversions. Thus both a two-input EXOR gate and a gate implementing function are is considered to be four-literal gates.
The improvements obtained by allowing global acknowledgment of signals are illustrated in Fig. 16 . For the STG of Fig. 16a) , output signals and are implemented by threeinput AND gates. Global acknowledgment makes it possible to find a decomposition into two-input AND gates, in which both outputs and are used to acknowledge the transitions of a new signal . No valid decomposition preserving speed-independence exists when is acknowledged by only one output (either or ). Methods which consider only local acknowledgment (i.e., within a single signal network), such as [6] would fail to find such decompositions.
The first set of columns in Table 3 indicates the complexity of the circuit before decomposition. The second set of columns reports the number of signals inserted for decomposition using gates with at most literals , and the CPU time required to find the decomposition (in seconds, for a SparcStation 20). The number of inserted signals shows also the number of iterations in the decomposition algorithm (the circuit is re-synthesized every time a new signal is inserted). The next column summarizes the results presented by Siegel [27] about the implementability of the circuit with only two-input gates. 7 All decompositions have been independently verified to be speed-independent.
Only five out of the 32 examples were not implementable by our method (which, like all other known methods in the literature, is only heuristic) with two-input gates (entry "n.i." in the table). Only one five-input gate in "pe-sendifc" and two five-input gates in "tsend-bm" could not be decomposed when attempting to implement these circuits with four-input gates. We significantly improve over the results presented in [27] , and only one circuit (pe-rcv-if) could not be implemented with two-input gates from that benchmark suite.
The global-acknowledgment allows the method to effectively decompose complex gates with high fan-in (six or seven literals). This is shown by circuits like mr1 and vbe10b that are implemented with two-input gates. Fig. 17 illustrates this fact, depicting the circuit mr1 before and after logic decomposition into two-input gates.
The final columns present a rough estimation of the cost of speed-independence-preserving logic decomposition. The cost is evaluated by comparing the area (after technology mapping) in the case of logic decomposition that preserves (SI) and does not preserve (non-SI) speed independence respectively. The former 7 A more direct comparison with more recent work [6] is difficult because the complexity of implementation is measured in [6] in terms of inputs to FPGA lookup tables, but not in terms of simple gates.
is performed by decomposing the circuits into threeliteral gates and then mapping onto a gate library. During mapping, small gates can be collapsed to match larger gates in the library without introducing hazardous behavior. The non-SI mapping is performed by SIS by using the following script: astg to f; source script.rugged; map; phase. In some cases, such as vbe6a, the area of the SI implementation is smaller than the non-SI one, because even non-SI decomposition is just a heuristic technique, and hence our algorithm happens to find a better solution. In most cases, on the other hand, a relatively low area cost (generally less than 15%) is required in order to preserve speed independence.
VI. CONCLUSION
This paper has presented a solution to the problem of logic decomposition of asynchronous speed-independent circuits. The method, implemented in the tool "petrify," is based on a two-step approach. The first step chooses a candidate for decomposition from the set of algebraic divisors of the target function. The second step performs the actual decomposition by implementing the candidate function as a new signal that can be used to resynthesize the whole circuit. Multiple acknowledgments for appear automatically at this function generation step and help to guarantee the hazard-freedom of the decomposed function. He is an Associate Professor in the Department of Software, Universitat Politècnica de Catalunya. In 1988, he was a Visiting Scholar at the University of California, Berkeley. His research interests include theory of concurrent systems applied to computer-aided design, with special emphasis on synthesis and formal verification of asynchronous systems and hardware-software codesign. He is also doing research on computer arithmetic and parallel architectures. He has coauthored over 80 research papers in technical journals an conferences.
Dr. Cortadella served on the technical committees of several international conferences in the field of design automation and concurrent systems. 
Michael Kishinevsky

