A novel synthesis method of a dual-rail asynchronous multi-level logic is proposed. The logic is implemented as a monotonous multi-level network of minimized AND-OR nodes together with the completion detection logic. Each node is a hazard-free structure. It is achieved based on the product term minimization constraint that the authors have formulated and proved in their previous paper. The MCNC and ISCAS benchmark sets were processed and the area overhead with respect to the synchronous implementation was evaluated. Then the implementation complexity of the proposed method and a state-ofthe-art method based on the duplication of every gate was compared. A considerable improvement was obtained.
INTRODUCTION
The asynchronous logic is classified depending on the mode of interaction with the environment. In the input-output mode, the environment is allowed to change the input state once the new output state is produced. There is no assumption about the internal signals and the environment is allowed to change the input state before the circuit is stabilized in response to the previous input state. In the fundamental mode, the logic operates based on the following discipline: the environment changes the input state once the output state has changed in response to the current input state and each gate inside the circuit is stable. Both design methodologies assume either bounded (a maximal value is known) or unbounded (a maximal value is unknown) gate and wire delays.
In case of the fundamental mode (accepted in this paper) with the bounded delays, the moment when the environment may change the input state is estimated based on the worst case propagation delay [Unger, 1969] . Within this model, only one input signal can be changed at a time. In [Nowick, 1993] , the generalized fundamental mode was proposed where multiple input changes are allowed during a narrow time interval. For such a mode, the method of hazard-free twolevel implementation was proposed [Nowick, 1995] . The multi-level (hazard not increasing) transformation is applied to optimize the implementation [Unger, 1969 and Kung, 1992] . The methods of hazard-free technology mapping were proposed in [Beerel, 1996 and Siegel, 1993] .
In case of the unbounded delays, the circuit should be capable to recognize the moment when input and output states have changed. For this purpose, both inputs and outputs are implemented using a dual-rail encoding. To change an input state the environment should reset it first (change to so called space state). The output state resets too, as a result. After that the environment sets a new input state. It implies a new output state. The multi-level implementations of the dual-rail asynchronous logic were proposed in [Cortadella, 2004 and Ligthart, 2000] . These methods are based on the initial circuit decomposition into simple (OR, AND, NOR, NAND, etc.) two-input gates. Further, each gate is mapped into DIMS [Sparsø, 1992] or into a so called threshold gate [Ligthart, 2000] . As a result, the circuit total complexity is very high. In [Cortadella, 2004] , each simple gate is doubled to ensure monotonicity and as a result a hazard-free implementation. In [Lemberski, 2009] , a two-level (NOR-NOR, NAND-NAND) dual-rail asynchronous logic suitable for mapping onto the conventional two-level structure was offered. Using this result, we propose a method that is based on the initial logic function decomposition into a single-rail Boolean network, where each node is represented as a two-level logic and the network is further transformation into a dual-rail one to ensure monotonicity and hazard-free implementation. Although our approach slightly increases the complexity of the functional logic, the completion detection logic complexity reduces significantly since the number of nodes that should be supplied with the completion detection is less than in [Cortadella, 2004] . As a result, considerable improvement in the sense of the total complexity is obtained.
PRELIMINARIES

Input/Output Dual-Rail Encoding
Let F = {f 1 , f 2 , …, f q } be an asynchronous multi-output function of n inputs X: X = {x 1 (Fig. 1) . We call it as a single-rail multi-level representation
Generally, an asynchronous logic should be capable: 1) to recognize the moment when a new input state (generated by the environment) appears on the inputs and the moment when the circuit generates a new output state in the response to the input one; 2) to notify the environment on new input and output states. After receiving the notification, the environment can generate the next input state. To solve this problem, inputs/outputs are implemented using a dual-rail encoding. To change the input state, the environment should reset it first to the space state and after that set it to a proper working state. In the reset phase, the output state changes from the working state to the space one and in the set phase the new output state is recognized.
As a result of the decomposition, each function y c is represented as a pair: y c = (y c (1) , y c (0) ), where y c (1) , y c (0) describe ON-, OFF-sets (y c (0) can be generated as a complement of the ON-set). After that each node (both ON-and OFF-sets) can be minimized to reduce the implementation cost. In [Lemberski, 2009] , we formulated a minimization constraint that the two-level logic should satisfy, to ensure a hazard-free implementation. Namely, each function y c should be represented as a pair of minimized Sum-of-Products (SOP) forms: y c = (Y c (1) , Y c (0) ), where Y c
, and t i , t j ∈ Y c (0) . A sum of the orthogonal products is called a Disjoint-Sum-Of-Products (DSOP). In [Cortadella, 2004] , conditions are formulated under which a Boolean network can be implemented as hazard-free logic. The conditions are based on each node monotonicity and hazardfree implementation.
STRUCTURE OF MULTI-LEVEL IMPLEMENTATION USING COMPLEX NODES
Monotonicity and Hazard-Free
Our structure is based on the concept of the monotonicity of the nodes introduced in [Cortadella, 2004] and the condition of each two-level (AND-OR) node hazard-free implementation proposed in [Lemberski, 2009] . The node monotonicity is easily achieved by the dual-rail encoding.
Hazard-free implementation. The Boolean network is hazardfree if each node is hazard-free.
The hazard-free implementation of the two-level positive dual-rail structures based on the formulated minimization condition: product terms implementing two-level AND-OR logic should be mutually orthogonal [Lemberski, 2009] .
Note that in [Cortadella, 2004] , the Boolean network with simple nodes (AND, OR, NAND gates, etc.) was considered. For such a network, the node monotonicity is the one requirement that guarantees its (and as a result, of the whole network) hazard-free implementation. However, it is not the case for the network with complex AND-OR nodes, where the additional condition (to ensure the node hazard-free implementation) should be formulated.
Basic Structure for Node Implementation
The implementation was proposed in [Lemberski, 2009] for multi-output logic (in our case, it should be reduced to a single-output one). It consists of two blocks ( Fig. 2) : a two-level AND-OR and the completion detection logic. Each AND gate implements a product term obtained after the minimization (remember, only the minimization that produces mutually orthogonal terms is allowed). Each product term is described by the set S( 
Multi-Level Network
Given an arbitrary multi-level Boolean network (Fig. 1) . The network is transformed into the dual-rail one based on the rules described in Section 2. Then, each pair of nodes representing a function in both its positive and negative form is mapped into the structure depicted in Fig. 2 . The multilevel structure consists of two blocks (Fig. 3) : the functional one implemented as a multi-level logic with two-level AND-OR single-output nodes with a fan-in limited to 2k (remember, once given a single-rail node, then in dual-rail each input is represented as two signals) and the completion detection logic that is obtained by merging the completion detection logic of all two-level nodes. The completion detection should indicate the proper state (working or space) of not only the network primary inputs and outputs but node outputs as well. The logic is based on (n+m) C-elements together with (n+m) two-input OR gates, where n is number of primary inputs and m the number of nodes (including the ones generating q primary outputs). The completion detection signal D (Fig. 3) is going up, when both primary inputs and node outputs are all in a working state and going down when the signals mentioned are all in a space state. Fig. 3 . Dual-rail multi-level network
SYNTHESIS PROCEDURE
The process of the synthesis of the multi-level dual-rail logic with AND-OR nodes is based on the tools ABC [Berkeley], Espresso [Brayton, 1984] and DSOP [Bernasconi, 2008] . First, an ABC script is applied to the initial circuit representation to obtain a multi-level single-rail Boolean network with the fan-in of each node limited to k. For this, we have decided to employ a LUT mapping synthesis process, since each LUT is actually represented as a singleoutput AND-OR node with a limited number of inputs (in the ABC output format).
We have used a sequence of ABC commands recommended for the LUT synthesis in the ABC reference guide. This command sequence was repeated 4-times, to obtain better results. strash balance fpga -K k Fig. 4 . The LUT decomposition script. Substitute k for the maximum node fan-in Then, the network is transformed into the dual-rail representation, by computing a complement of each node (using the "sharp" operator [Brayton, 1984] ). As a result, the number of nodes is doubled, while now the node functions may depend on 2k inputs or less (since a positive and negative signal is represented as a separate rail). Next, the minimization is performed (using Espresso) for the OFF-set nodes to obtain the minimized function: y c = (Y c (1) , Y c (0) ), y c 0) . Finally, we run DSOP [Bernasconi, 2008] for all the nodes, to obtain mutually orthogonal terms.
EXPERIMENTAL RESULTS
Experimental background
We have processed the MCNC [Yang, 1991] and ISCAS [Brglez, 1985 [Brglez, , 1989 ] sets of benchmarks, 228 circuits altogether. We evaluate the complexity (expressed as the gate equivalents (GEs) number [De Micheli, 1994] ) of the proposed asynchronous implementation of these circuits.
For the structure proposed, we estimate the complexity of the functional network and the completion detection logic separately. Then, the total complexity is calculated. To avoid additional inverters and therefore decrease the implementation complexity, we use negative (NAND-NAND) gates instead of AND-OR ones in the functional block and NOR gates instead of OR ones in the completion detection logic. As a result, the signal D = 1 (D = 0), when all inputs and outputs are in the space (working) state. Duplicated terms are implemented only once. We suppose a technology independent synthesis (fan-ins of negative gates and C-element are not limited). The gate complexity is estimated as follows: an n-input NAND or NOR gate requires 0.5n GEs [Sparsø, 2001] . To implement an (n+m)-input C-element, (n+m+1) GEs are required. To implement n+m two-input NOR gates, 0.5(n+m) GEs are required. Totally, (1.5n+1.5m+1) GEs are required to implement the completion detection logic for an n-input multi-level logic with m nodes.
Note, that the complexity of the sequential logic memory (flip-flops, latches) is not included in the results.
Selection of k
The first issue addressed in the experiments is a proper selection of k (maximum node fan-in). For large k's, there often arise problems with computing complements of the nodes, for an exponential complexity of the operation. More importantly, nodes with a high fan-in are difficult to be implemented in technology. On the other hand, small k's induce more nodes, which makes the completion detection logic more complex.
A similar problem has been encountered in the design of the FPGA fabrics, when deciding for the optimum look-up tables (LUT) size [Gao, 2005] . It has been found that implementing the design using 4-or 5-input LUTs brings most benefits. We have also reproduced this observation by performing numerous experiments. An example is shown in Fig. 5 , for the 9sym MCNC benchmark circuit. We have synthesized this circuit using the LUT-decomposition script (see Fig. 4 ), for k varying from 2 to 20. The total complexity of the asynchronous logic (i.e., the functional logic with the completion detection) was measured. A deep global minimum can be observed for k = 4. Very similar results are obtained from a vast majority of other benchmark circuits, for both decomposition scripts. For this reason, all the following experiments will be performed for k = 4. 
Standard Benchmarks Results
Results obtained for selected MCNC [Yang, 1991] and ISCAS [Brglez, 1985 [Brglez, , 1989 ] benchmark circuits are presented in the summary Table 1 . We have evaluated the area overhead of our proposed asynchronous logic design method w.r.t. a conventional synchronous design. Then, we have compared our method with a state-of-the-art asynchronous logic design method proposed in [Cortadella, 2004] . In all the cases, 4-input AND-OR nodes are considered.
In Table 1 , first, the benchmark name and numbers of its primary inputs and outputs (n, q) are given. Synthesis results obtained by decomposing the original circuit into a network of 4-input AND-OR nodes are shown in the following triplet of columns "Synchronous". The first column indicates the number of network levels (critical path), the number of decomposed circuit nodes follows, the last column shows the complexity of the circuit's synchronous implementation, in terms of GEs.
The complexity of the proposed asynchronous multi-level implementation of the circuits is shown next. Complexities of the functional logic ("Funct. GEs") and the completion detection logic ("CD GEs") are shown first, then the values are summed together to obtain the final asynchronous logic complexity ("Total GEs"). The area increase of the asynchronous logic w.r.t. the synchronous implementation is shown in the next column ("Over.").
Complexities of the asynchronous multi-level implementation proposed in [Cortadella, 2004] are shown in the next triplet of columns. Again, the functional, completion detection and total complexities are given. The area reduction obtained by our method, w.r.t. [Cortadella, 2004] , is shown in the last table column ("Impr").
Summary of the Experiments
We have processed 228 benchmark circuits altogether. The area overhead of the asynchronous implementation, compared to the synchronous implementation is increased by 64% in the average. When compared to the state-of-theart approach, we have obtained an average improvement of 17%. However, for some circuits, the improvement reaches up to 40%.
CONCLUSION
A novel synthesis method of a dual-rail asynchronous multilevel logic is proposed. The logic is implemented as a monotonous multi-level network of minimized AND-OR nodes together with the completion detection logic. Each node is a hazard-free structure. It is achieved based on the product term minimization constraint (product terms must be mutually orthogonal) that the authors have formulated and proved in [Lemberski, 2009] . The MCNC and ISCAS benchmarks were processed and the complexity of the synchronous and asynchronous implementations was compared. For the asynchronous logic, the area overhead is 64% in the average. In comparison with the state-of-the-art approach, we reached a 17% area improvement in the average.
