In this paper we study structural gate decomposition in general, simple gate networks for depth-optimal technology mapping using K-input Lookup-Tables (K-LUTs). We show that (1) structural gate decomposition in any K-bounded network results in an optimal mapping depth smaller than or equal to that of the original network, regardless of the decomposition method used; and (2) the problem of structural gate decomposition for depth-optimal technology mapping is NP-hard for K-unbounded networks when K Ն 3 and remains NP-hard for K-bounded networks when K Ն 5. Based on these results, we propose two new structural gate decomposition algorithms, named DOGMA and DOGMA-m, which combine the level-driven nodepacking technique (used in Chortle-d) and the network flow-based labeling technique (used in FlowMap) for depth-optimal technology mapping. Experimental results show that (1) among five structural gate decomposition algorithms, DOGMA-m results in the best mapping solutions; and (2) compared with speed_up (an algebraic algorithm) and TOS (a Boolean approach), DOGMA-m completes decomposition of all tested benchmarks in a short time while speed_up and TOS fail in several cases. However, speed_up results in the smallest depth and area in the following technology mapping steps.
INTRODUCTION
Field programmable gate arrays (FPGAs) have been widely used in circuit design implementation and system prototyping due to their short design cycles and low nonrecurring engineering costs. An important class of FPGAs use lookup-tables (LUTs) as the basic logic element. A K-input LUT (K-LUT), which consists of 2 K SRAM cells, can store the truth table of an arbitrary Boolean function of up to K variables. By connecting LUTs into a network, LUT-based FPGAs can be used to implement circuit designs in a short time.
Logic synthesis for LUT-based FPGAs transforms networks of logic gates into functionally equivalent LUT networks. The process is usually divided into two tasks: logic optimization and technology mapping. Logic optimization extracts common subfunctions to reduce the circuit size and/or resynthesizes critical paths to reduce the circuit delay. Technology mapping consists of two subtasks: gate decomposition and LUT mapping. In gate decomposition, large gates are decomposed into gates of at most K inputs (that is, K-bounded). The resulting K-bounded network is then mapped onto (i.e., covered by) K-LUTs in the LUT mapping step. The separation of optimization and mapping tasks is artificial. Some LUT synthesis algorithms (e.g., Lai et al. [1994] and Wurth et al. [1995] ) decompose collapsed networks into LUT networks directly. The objectives of these tasks include area minimization, delay minimization, routability maximization, or a combination of all of them. A comprehensive survey of gate decomposition, LUT mapping, and logic synthesis algorithms for LUT-based FPGAs can be found in .
The delay of an LUT network can be measured by the number of levels (or depth) in the network under the unit delay model. A number of algorithms were proposed in the past for delay-oriented LUT mapping. We classify them into two classes. The first class of algorithms, such as Chortle-d [Francis et al. 1991b] ; DAG-Map [Chen et al. 1992] ; and FlowMap [Cong and Ding 1994a] perform LUT mapping without logic resynthesis. Among these algorithms, Chortle-d guarantees depth-optimal technology mapping for simple gate tree networks, and FlowMap guarantees depth-optimal LUT mapping for general K-bounded networks. Following FlowMap, FlowMap-r [Cong and Ding 1994b] and CutMap further reduce the mapping area, and FlowMap-d [Cong and Ding 1994c] and Edge-Map [Yang and Wong 1994] minimize delay under a more accurate net delay model. Another class of LUT mapping algorithms, such as MIS-pga-delay [Murgai et al. 1991] ; TechMap-D [Sawkar and Thomas 1993] ; FlowSyn [Cong and Ding 1993] ; and ALTO [Huang et al. 1996 ] collapse critical paths followed by delay-oriented logic resynthesis. Due to resynthesis, this class of algorithms could obtain mapping depth smaller than the optimal depth computed by FlowMap, but usually with longer computation time.
Gate decomposition may significantly affect the network depth obtained by the algorithms in the first LUT mapping class. For example, the network in Figure 1 (a) is not a K-bounded network for K ϭ 3. When node v is decomposed as shown in Figure 1 (b), any mapping algorithm will result in a depth of 3 or larger. But if node v is decomposed in the way shown in Figure 1 (c), a mapping solution with a depth of 2 can be obtained. In addition, when a K-bounded network is further decomposed, the mapping depth could be reduced. Figure 2 (a) shows a 3-bounded network. For K ϭ 3, FlowMap produces a 3-level mapping solution of 5 LUTs. (Every shaded square represents an LUT in the figure.) But if node v is further decomposed, FlowMap produces a 2-level network of 4 LUTs (Figure 2(b) ). The two examples demonstrate that gate decomposition affects the depth obtained by LUT mapping algorithms.
We classify gate decomposition methods into structural, algebraic, or Boolean approaches. Structural gate decomposition can only be applied to simple gates (e.g., AND gates, OR gates, XOR gates). Complex gates need to be transformed into simple gates (e.g., via AND-OR decomposition) before any structural decomposition. The tech_decomp algorithm in SIS [Sentovich et al. 1992] ; the dmig algorithm [Wang 1989; Chen et al. 1992] ; and the Chortle family of mapping algorithms [Francis et al. 1991a; 1991b] perform structural gate decomposition. In algebraic gate decomposition approaches, networks are usually partially collapsed and gates are represented in the sum-of-product (SOP) form. Common logic subfunctions are then extracted with algebraic divisions [Rudell 1989; De Micheli 1994] . The speed_up algorithm in SIS [Sentovich et al. 1991] is an algebraic approach which collapses critical paths followed by network resynthesis for delay minimization. In Boolean gate decomposition approaches, logic gates are decomposed via functional operations. Shannon expansion, if-then-else (ITE) decomposition, and AND-OR decomposition are very common Boolean gate decomposition operations. Recently, functional decomposition techniques [Ashenhurst 1959; Curtis 1961; Roth and Karp 1962] were used in a number of LUT network synthesis algorithms [Lai et al. 1994; Wurth et al. 1995; Legl et al. 1996b] . In these algorithms, networks are completely collapsed whenever possible so that the outputs can be represented as functions of the network inputs directly. The output functions are then decomposed into composed K-input subfunctions for implementation using K-LUTs. Optional LUT mapping steps may follow to improve the synthesis results. The FGSyn algorithm [Lai et al. 1994 ] and the BoolMap-D algorithm [Legl et al. 1996b ] take this approach for delay-oriented LUT network synthesis. Generally speaking, algebraic approaches and Boolean approaches are more effective for both area and delay minimization in technology mapping, while structural approaches are usually faster. Hybrid approaches such as algebraic decompositions followed by structural decompositions are used in many logic synthesis approaches. In this paper we study structural gate decomposition for delay minimization in general networks with the following motivations. First, we have shown how gates are decomposed, which can affect the mapping depth computed by FlowMap. A good gate decomposition step allows mapping algorithms to obtain the smallest mapping depth. Second, structural gate decomposition allows arbitrary grouping of gate inputs for our optimization objective, while algebraic or Boolean approaches do not have this advantage. Third, structural gate decomposition is computationally efficient. This is an important factor for mapping large designs and estimating the mapping delay or area. Nowadays, the IC process technology has advanced to 0.18 m and below. Million-gate FPGAs have become a reality. Structural gate decomposition algorithms can be employed in the technology mapping approaches along with this technology trend.
Several delay-oriented structural gate decomposition algorithms were proposed in the past. The tech_decomp algorithm [Sentovich et al. 1992] decomposes each simple gate into a balanced fanin tree to minimize the number of levels locally. The dmig algorithm [Wang 1989; Chen et al. 1992] is based on the Huffman coding algorithm and guarantees the minimum depth in the decomposed network. However, the mapping depth might not be the minimum. The network in Figure 1 (b) is actually decomposed using dmig and results in a suboptimal mapping depth. The Chortle-d algorithm [Francis et al. 1991b ] employs bin-packing heuristics to achieve depth minimization, but is optimal for trees only. In this paper we go one step further. We shall develop structural gate decomposition algorithms for depth-optimal technology mapping on general networks.
The rest of this paper is organized as follows. Section 2 defines the terminology, presents general properties, and formulates the structural gate decomposition problems. Section 3 addresses the NP-completeness of the problems. Section 4 presents two new algorithms, DOGMA and DOGMA-m, for structural gate decomposition. Experimental results are presented in Section 5, and Section 6 concludes the paper. A preliminary version of this work was published in DAC'96 [Cong and Hwang 1995] without the proofs of theorems and considered single-gate decompositions only.
PROBLEM FORMULATION

Definitions and Preliminaries
A combinational Boolean network N can be represented by a directed acyclic graph N ϭ ͑V, E͒ where each node v ʦ V represents a logic gate and each directed edge ͑u, v͒ ʦ E represents a connection from the output of node u to the input of node v. A node v is a simple gate if v implements one of the following functions: AND, OR, XOR, or their inversions. Primary inputs (PIs) are nodes of in-degree zero. Other nodes are internal, and some are designated as primary outputs (POs). A node v is a predecessor of a node u if there is a directed path from v to u in N. The depth of a node v is the number of edges on the longest path from any PI to v. Each PI has a depth of zero. The depth of a network is the largest depth for nodes in the network. Let input͑v͒ and fanout͑v͒ represent the set of fanins and the set of fanouts of node v, respectively. Given a subgraph H of N, let input͑H ͒ denote the set of distinct nodes outside H that supply inputs to nodes in H. A fanin cone C v rooted at v is a connected subnetwork consisting of v and its predecessors. Node v is the root node of C v , and is denoted as root͑C v ͒ ϭ v. 
By implementing every subnetwork in M using a K-LUT, we obtain a K-LUT network that is functionally equivalent to N. The mapping area and the mapping depth of M is the LUT count (i.e., ԽMԽ) and the depth in the K-LUT network that implements M, respectively.
Given a K-bounded network N, let S K ͑N ͒ represent the set of K-LUT networks that implement all mapping solutions of N. The minimum mapping depth of N, denoted MMD͑N ͒, is the minimum network depth for all K-LUT networks in S K ͑N ͒. Let N v represent the largest fanin cone rooted at v in N. The minimum mapping depth of a node v ʦ N, denoted MMD N ͑v͒, is MMD͑N v ͒. The mapping depth of any PI is 0. Given a K-bounded network N, the FlowMap algorithm [Cong and Ding 1994a] computes MMD N ͑v͒ for every node v ʦ N in polynomial time.
FlowMap computes a min-height K-feasible cut in the fanin cone of each node v to obtain MMD N ͑v͒.
The following two lemmas are on the minimum mapping depth in general networks. Lemma 1 states the monotone property of minimum mapping depth and Lemma 2 gives a way to compute MMD N ͑v͒. LEMMA 1. [Cong and Ding 1994a] . Let N ϭ ͑V, E͒ be a K-bounded network and let node v ʦ V. Then MMD N ͑u͒ Յ MMD N ͑v͒ for every fanin u ʦ input͑v͒. LEMMA 2. [Cong and Ding 1994a] . Let N ϭ ͑V, E͒ be a K-bounded network, node v ʦ V, and let max͕MMD N ͑u͒u ʦ input͑v͖͒ ϭ p. Then MMD N ͑v͒ ϭ p if there exists a K-feasible cut of height p Ϫ 1 in N v . Otherwise, MMD N ͑v͒ ϭ p ϩ 1.
Properties of Structural Gate Decomposition
Simple gates allow arbitrary grouping of their fanins in decomposition. However, the grouping and the resulting gate size in decomposition can significantly affect the depth and area in the final mapping solution. In this section, we show that the best mapping results can only be obtained from completely decomposed networks.
Let node v be a simple gate in a network N and let Խinput͑v͒Խ Ն 3. Given a structural gate decomposition algorithm D, a decomposition step D v on node v (i) chooses two fanins u 1 and u 2 of v; (ii) removes edges ͑u 1 , v͒ and ͑u 2 , v͒; and (iii) introduces a node w and three edges ͑u 1 , w͒, ͑u 2 , w͒, and ͑w, v͒ to reconnect u 1 , u 2 and v. Because v is a simple gate, D v can always be applied. Node w has the same gate type as node v. For any subnetwork NЈ ϭ ͑VЈ, EЈ͒ of N and a decomposition step D v , we define
A network is completely decomposed when it becomes 2-bounded. In Figure 3 Gate Decomposition and LUT Mapping
PROOF. Let w be the node introduced by
Note that Theorem 1 and Corollary 1.1 hold as long as the decomposition step at v (structural, algebraic, or Boolean) can be carried out, regardless whether v is a simple gate or not. However, the algebraic or functional decomposition for a complex gate may not always be possible.
Since the set of all possible functionally equivalent K-LUT networks expands whenever a simple gate is decomposed (Theorem 1), it is always beneficial to decompose simple-gate networks into 2-bounded networks for LUT mapping algorithms to exploit the larger mapping solution space. The experimental results reported in Cong and Ding [1994a] confirm this conclusion. In their experiments, the input networks were first transformed into simple gate networks and then decomposed structurally into 5-bounded, 4-bounded, 3-bounded, or 2-bounded networks before LUT mapping. The resulting mapping depth decreases monotonically along with the decrease of gate sizes in decomposition. An interesting contrast comes from the results reported in Legl et al. [1996a] , where networks were first collapsed completely and then decomposed functionally into 5-bounded, 4-bounded, or 3-bounded networks for LUT mapping. The best mapping solutions in terms of area and depth are mostly from the 5-bounded networks. The two experiments show an important difference between structural and functional decompositions: logic signals are preserved in structural decompositions, while new gates are synthesized during functional decompositions. In Legl et al. [1996a] , the 5-bounded, 4-bounded, and 3-bounded networks contain totally different sets of internal gates, which are synthesized independently in three functional decomposition processes. In fact, according to Corollary 1.1, if the 5-bounded networks in Legl et al. [1996b] were further decomposed before LUT mapping, even smaller mapping depth could be obtained in their experiments.
The following lemma specifies a condition where the structural gate decomposition will not cause further mapping depth reduction. 
Otherwise node x i won't be a depth-reduced node. We continue to trace depth-reduced nodes towards PIs. This tracing, however, won't reach PIs since PIs have a depth of 0. At certain depth, the second case must occur. PROOF. Since the intermediate node w has the same depth as node v, this lemma is true according to Lemma 3. e
Integrated versus Two-Step Technology Mapping
Gate decomposition and LUT mapping can be performed in two different ways. In an integrated mapping approach, the input network is decomposed and covered by LUTs simultaneously, while in a two-step mapping approach, the input network is decomposed into a K-bounded network before
Gate Decomposition and LUT Mapping
• LUT mapping is performed. For example, Chortle-d is an integrated mapping approach, while FlowMap fits only into a two-step mapping approach. The separation of gate decomposition and LUT mapping is a restriction in general, since integrated approaches allow more informative gate decomposition and LUT mapping decisions, while two-step approaches do not have this advantage. It may appear that the minimum mapping depth for all integrated mapping approaches will be smaller than the minimum mapping depth for all two-step mapping approaches. However, we show that this is not the case for structural gate decomposition.
THEOREM 2. Given a K-bounded network N, if only structural gate decomposition is allowed, the minimum mapping depth for all integrated mapping approaches equals the minimum mapping depth for all two-step mapping approaches.
PROOF. Given an arbitrary K-bounded network N, assume some integrated approach results in the optimal depth MMD͑N ͒ in a mapping solution M N . Then M N is a mapping solution of some K-bounded network NЈ decomposed structurally from N. A depth-optimal mapper (e.g., FlowMap) can take NЈ as input and generate a mapping solution M NЈ . Since M NЈ is depth-optimal with respect to NЈ, we have MMD͑NЈ͒ Յ MMD͑N ͒. But M N is depth optimal with respect to N. As a result,
Our mapping algorithms, presented in Section 4, should be considered a hybrid approach. On one hand, depth minimization is achieved in structural gate decomposition (by DOGMA or DOGMA-m) to return a network topology of the minimum mapping depth; on the other hand, the LUT mapping solution is computed in depth-optimal LUT mapping with area minimization as a second objective. As a result, the depth and the area are optimized separately in the two steps of technology mapping. Hence we consider our algorithm a hybrid approach.
The SGD/K and K-SGD/K Problems
In this paper we study structural gate decomposition of K-bounded or K-unbounded simple gate networks into 2-bounded networks such that LUT mapping algorithms (e.g., FlowMap) can obtain the smallest mapping depth. We formulate the following two problems.
Structural gate decomposition for K-LUT mapping (SGD/K). Given a simple-gate
K-unbounded network N ϱ , decompose N ϱ into a 2-bounded network N 2 such that MMD͑N 2 ͒ Յ MMD͑NЈ 2 ͒ for any other 2-bounded decomposed network NЈ 2 of N ϱ .
Structural gate decomposition in K-bounded network for K-LUT mapping (K-SGD/K). Given a simple gate
K-bounded network N K , decompose N K into a 2-bounded network N 2 such that MMD͑N 2 ͒ Յ MMD͑NЈ 2 ͒ for any other 2-bounded decomposed network NЈ 2 of N K .
COMPLEXITY OF SGD/K AND K-SGD/K PROBLEMS
We show the following results: (1) the SGD/K problem is NP-hard for K Ն 3; and (2) the K-SGD/K problem is NP-hard for K Ն 5. We present the construction for the NP-complete reduction, the lemmas and theorems, and the proofs for theorems. Proofs for lemmas can be found in the Appendix.
Our results are based on polynomial-time transformations from the 3SAT problem to the decision version of the SGD/K and the K-SGD/K problems. The 3SAT problem, which is a well-known NP-complete problem [Garey and Johnson 1979] , is defined as follows:
Problem: 3-Satisfiability (3SAT).
Instance: A set of Boolean variables X ϭ ͕x 1 , x 2 , . . . , x n ͖ and collection of m clauses C ϭ ͕C 1 , C 2 , . . . , C m ͖, where (i) each clause is the disjunction (OR) of 3 literals of the variables; and (ii) each clause contains at most one of x i and x i for any variable x i .
Question: Is there a truth assignment for the variables in X such that
We transform an arbitrary instance of 3SAT to an instance of SGD/K in polynomial time. The idea is to relate the truth assignment of variables in 3SAT to the decision of gate decomposition in SGD/K. Since determining the truth assignment is difficult, the decision of gate decomposition is also difficult. We define the decision version of the SGD/K problem as follows:
Problem: Structural gate decomposition for K-LUT mapping (SGD/K-D).
Instance: A constant K Ն 3, a depth bound B, and a simple gate K-unbounded network N ϱ .
Question: Is there a way to structurally decompose N ϱ into a 2-bounded network N 2 such that the depth-optimal K-LUT mapping solution of N 2 has a depth no more than B?
Given an instance F of 3SAT with n variables x 1 , x 2 , . . . , x n and m clauses C i , C 2 , . . . , C m , we construct a K-unbounded network N͑F ͒ corresponding to the instance F, as follows. First, for each variable x i , we construct a subnetwork N͑x i ͒, which consists of the following nodes: (a) two output nodes denoted x i and x i ; (b) ͑2K 2 Ϫ 3K ͒ PI nodes in which two of them are denoted PI i 1 and
kϪ2 , w i 1 , w i 2 and s i , respectively; The nodes are connected as shown in Figure 5 . Each node of w i 1 and w i 2 has K Ϫ 1 PI fanins. Node s i has 4 fanins from w i 1 , w i 2 , PI i 1 and PI i 2 . Every other internal node has K PI fanins. Note that N͑x i ͒ is well defined for K Ն 3 and is K-bounded for K Ն 4.
Next, for each clause C j with 3 literals l j 1 , l j 2 , l j 3 , we construct a subnetwork N͑C j ͒, which consists of the following nodes: (a) one output node denoted C j ; (b) three literal nodes denoted l j 1 , l j 2 , l j 3 ; (c) ͑2K Ϫ 5͒ internal Gate Decomposition and LUT Mapping
• nodes q j 1 , . . . , q j 2KϪ5 , each is the root of a complete 2-level K-ary tree with PI nodes as leaves; (d) (K Ϫ 2) internal nodes r j 1 , . . . , r j kϪ2 , each is the root of a complete 3-level K-ary tree with PI nodes as leaves. The connections are shown in Figure 6 (a). The output node C j has all internal nodes as its fanins in N͑C j ͒. Note that N͑C j ͒ is well defined for K Ն 3. However, the output node C j is not K-bounded.
Finally, we connect the subnetworks N͑C j ͒, j ϭ 1, 2, ..., m with the subnetworks N͑x i ͒, i ϭ 1, 2, ..., n, as follows, to form the network N͑F ͒. Let literal l j k be a literal in clause C j . If l j k ϭ x i where x i is a variable, we connect node x i in N͑x i ͒ as the single fanin of node l j k in N͑C j ͒. Similarly, if l j k ϭ x i , we connect node x i in N͑x i ͒ as the single fanin of node l j k in N͑C j ͒. Note that every literal node has exactly one fanin. This fanin node is called the variable node of the corresponding literal node. Network N͑F ͒ has m primary outputs: nodes C 1 , . . . , C m . We illustrate the construction of N͑F ͒ by an example. Assume F ϭ ͑x 1 ϩ x 2 ϩ x 3 ͒͑x 2 ϩ x 3 ϩ x 4 ͒͑x 1 ϩ x 3 ϩ x 4 ͒. The network N͑F ͒ is shown in Figure 7 . Because clause C 1 ϭ ͑x 1 ϩ x 2 ϩ x 3 ͒, we connect nodes x 1 , x 2 , x 3 as fanins to nodes l 1 1 , l 1 2 , l 1 3 in N͑C 1 ͒, respectively. Node x 1 is the variable node of node l 1 1 . We have the following lemma. 
204
• in polynomial time, we can set B ϭ 4 and solve 3SAT in polynomial time. Since 3SAT is NP-hard, the SGD/K-D problem is NP-hard. For a given decomposed network D͑N͑F ͒͒ of N͑F ͒, it takes polynomial time to compute its mapping depth d and verify whether d Յ B (e.g., by FlowMap). As a result, the SGD/K-D problem is NP-complete. Since N͑x i ͒ and N͑C j ͒ are well defined for K Ն 3, the SGD/K-D problem is NP-complete for K Ն 3. Hence the SGD/K problem is NP-hard for K Ն 3. e
We now show the complexity of the K-SGD/K problem. In this construction of reduction, we must have every node K-bounded (note that N͑C j ͒ is not K-bounded in the previous construction). Given an instance F of the 3SAT with n variables x 1 , x 2 , . . . , 
Gate Decomposition and LUT Mapping
• construct a corresponding K-bounded network N K ͑F ͒, as follows. For each variable x i , construct subnetwork N͑x i ͒ as before (shown in Figure 5 ). However, for each clause C j , construct subnetwork N K ͑C j ͒ consisting of (a) one output node denoted C j ; (b) three literal nodes denoted l j 1 , l j 2 , l j 3 , (c) ͑K Ϫ 5͒ internal nodes q j 1 , . . . , q j KϪ5 , each of them is the root of a complete 2-level K-ary tree with PI nodes as leaves. The subnetwork N K ͑C j ͒ is shown in Figure 8 (a). Note that N K ͑C j ͒ is well defined and K-bounded for K Ն 5. We connect subnetworks N͑x i ͒ and N K ͑C j ͒ according to the formula F as before, to obtain the network N K ͑F ͒. We have the following lemma.
LEMMA 6. The 3SAT instance F is satisfiable if and only if N
Based on similar arguments in the proof of Theorem 3, it is easy to see the K-SGD/K problem is NP-hard for K Ն 5. e
GATE DECOMPOSITION ALGORITHMS FOR DEPTH-OPTIMAL MAPPING
In this section we combine the node-packing technique in Chortle-d with the min-height K-feasible cut technique in FlowMap in structural gate decomposition of simple-gate networks. Our objective is to minimize the depth in the final mapping solution. We propose two algorithms. The first algorithm decomposes logic gates independently, as in most previous approaches, while the second algorithm decomposes multiple gates simultaneously to exploit common fanins. The advantage of multigate decomposition can be seen in one example. Nodes a, b, . . . , f in Figure 9 are primary inputs. If nodes u and v in Figure 9 (a) are decomposed independently, we might obtain a network in Figure 9 (b). For K ϭ 3, the best
.... 
Single Gate Decomposition
We present our single gate decomposition algorithm DOGMA (Depth-Optimal Gate decomposition for MApping) in this section. Given a simple gate network N, DOGMA decomposes nodes in topological order from PIs to POs. At each node v, DOGMA decomposes and labels v with the number l͑v͒ ϭ MMD N͑v͒ ͑v͒ where N͑v͒ denotes the decomposed network. The set of fanins of label q in input͑v͒, denoted S q , is called a stratum of depth q. A K-feasible cut of height q Ϫ 1 exist for every node in S q . A K-feasible cut of height q Ϫ 1 exists for a set B of nodes if such a cut exists for a node s created with input͑s͒ ϭ B. DOGMA groups input͑v͒ into strata according to their labels, and processes each stratum in two steps.
(1) Starting from stratum S q of the smallest depth, DOGMA partitions S q into a minimum number of subsets such that there exists a K-feasible cut of height q Ϫ 1 for each subset of nodes. The process is similar to packing objects into bins. Each bin has a size of K. The size of a node (also called an object) is the size of its min-cut of height q Ϫ 1. A set of nodes can be packed into one bin if their overall size is no larger than K. Such a bin is called a min-height K-feasible bin, which corresponds to a partitioned subset of S q . Note that the overall cut size for nodes in a set could be smaller than the sum of their individual cut sizes. Gate Decomposition and LUT Mapping • created for each w i with input͑b i ͒ ϭ ͕w i ͖ and a label l͑b i ͒ ϭ q ϩ 1. All buffer nodes are put into the set S qϩ1 . Note that if some bin B i contains more than 2 nodes, bin node w i needs to be further decomposed. However, according to Lemma 4, no matter how w i is decomposed, the minimum mapping depth of the network does not change. DOGMA arbitrarily decomposes w i into an unbalanced tree.
DOGMA repeats steps (1) and (2) for stratum S q ϩ 1, and so on, until all strata have been processed. The last bin node corresponds to node v. Note that buffer nodes are introduced only for the packing process, and will be removed when the decomposition is complete.
To determine if there exists a K-feasible cut of height q Ϫ 1 for a bin B i ʕ S q of nodes, we compute a max-flow in the flow network, constructed as follows [Cong and Ding 1994a] We illustrate DOGMA for K ϭ 3. The output node v in Figure 10 (a) is under decomposition. Among the five fanins of v, b, c, d have labels l͑b͒ ϭ l͑c͒ ϭ l͑d͒ ϭ 2 and a, e have labels l͑a͒ ϭ l͑e͒ ϭ 3. As a result, S 2 ϭ ͕b, c, d͖ and S 3 ϭ ͕a, e͖. According to DOGMA, b, c will be packed into one bin, since a K-feasible cut of height 1 exists for them, and d into another bin for a total of two (which is the minimum) min-height K-feasible bins. Then bin nodes f and g with labels l͑ f ͒ ϭ l͑g͒ ϭ 2 and buffer nodes h and i with labels l͑h͒ ϭ l͑i͒ ϭ 3 are created for the two bins, respectively (see Figure 10(b) ). DOGMA proceeds to the stratum of depth 3. Two K-feasible cuts of height 2 are found for ͕a, h͖ and ͕i, e͖, respectively. Again, bin nodes j and k with labels l͑ j͒ ϭ l͑k͒ ϭ 3 and buffer nodes m and n with labels l͑m͒ ϭ l͑n͒ ϭ 4 are created for the two bins, respectively. Nodes m and n are then packed into a bin that corresponds to v (see Figure 10(c) ). Finally, nodes g, h, i, m and n are removed and node v is completely decomposed with a label l͑v͒ ϭ 4.
The following problem has to be solved in DOGMA.
Min-height K-feasible bin-packing problem. Given a stratum S q of depth q, pack nodes in S q into a minimum number of min-height K-feasible bins.
In our study we developed three heuristics to solve the problem. The first-fit-decreasing (FFD) and best-fit-decreasing (BFD) are two heuristics
208
• for the bin-packing problem [Horowitz and Sahni 1978] . The FFD heuristic sorts objects into a list of objects of decreasing sizes, indexes the bins 1, 2, 3, ..., then removes the object from the list (in order) and puts it into the first bin that can accommodate it. The initial conditions on the bins and objects in the BFD heuristic are the same as in the FFD heuristic. But BFD puts the object into the bin that leaves the smallest empty space. For the min-height K-feasible bin-packing problem, we proposed two min-cut-based heuristics, MC-FFD and MC-BFD, which are analogous to FFD and BFD, except that every object is a node whose size is defined to be the size of its min-cut of height q Ϫ 1. A set of nodes can be packed into a K-feasible bin as long as their combined cut size is no larger than K. The third heuristic is called maximal-sharing-decreasing (MC-MSD), which encourages sharing during packing, i.e., the size of the min-cut for the packed nodes is smaller than the sum of their individual min-cut sizes. The packing that produces the maximum sharing is considered the best-fit packing when MC-MSD calls MC-BFD for a packing result.
Experimental results (Table I) show very few differences on mapping results among the three heuristics (DOGMA followed by CutMap) for MCNC benchmarks. It indicates that in most cases the same number of bins were obtained by the three heuristics. This could be due to the small bin size ͑K ϭ 5͒ in the experiment. We chose MC-FFD for its efficiency. The FFD heuristic is also used in Chortle-d for packing nodes into bins. However, MC-FFD packs nodes according to the size of their min-height K-feasible cut for better performance. With reconvergent fanouts in general networks, Gate Decomposition and LUT Mapping
• one cannot decide locally whether a set of nodes can be packed into one bin or not. For example, it is not obvious that nodes e and i in Figure 10 (b) can be packed into one bin. The MC-FFD heuristic employs max-flow computation and can decide the packing feasibility correctly. The time complexity of DOGMA is computed as follows: For every node v in the input network N ϭ ͑V, E͒, structural gate decomposition will create Խinput͑v͒Խ Ϫ 2 nodes. In total, there are ⌺ v⑀V ͑Խinput͑v͒Խ Ϫ 2͒ ϭ ԽEԽ Ϫ 2 ⅐ ԽVԽ ϭ 0͑ԽEԽ͒ nodes created. The min-height K-feasible cut computation has a time complexity of O͑K ⅐ ԽEԽ͒ [Cong and Ding 1994a] where K is the LUT input size, and is carried out O͑Խinput͑v͒Խ 2 ͒ times in the worst case at each node v in the MC-FFD heuristic. Let d max be the maximal fanin size for nodes in N. Then the time complexity of DOGMA is O͑K ⅐ d max 2 ⅐ ԽEԽ 2 ͒. We can reduce the time complexity of min-height cut computation to O͑K ⅐ ԽE p Խ͒ by constructing partial flow networks only to a certain depth, where E p is the edge set of the partial flow network. Let E pϪmax represent the edge set of the largest partial flow network constructed during decomposition. Then the time complexity of DOGMA is reduced to O͑K ⅐ d max 2 ⅐ ԽE pϪmax Խ ⅐ ԽEԽ͒. 
Multiple Gate Decomposition
We present our multiple gate decomposition algorithm, called DOGMA-m, and illustrate the procedure on the network shown in Figure 11 (a) for K ϭ 3. DOGMA-m is outlined in Figure 12 .
We call the stratum of each node a local stratum. The union of all local strata of depth q is called the global stratum of depth q. For each depth q, a node v is under decomposition if Խinput͑v͒Խ Ͼ 2 (i.e., not yet completely decomposed) and input͑v͒ intersets with the global stratum of depth q. Starting from depth q ϭ 1 and up, the nodes of the same gate type and also under decomposition will be decomposed simultaneously. In Figure 11(a),  nodes a, b, . . . , h all have a label of 1. Nodes x, y, and z are under decomposition for q ϭ 1. The local stratum of depth 1 is ͕a, b, c͖ for node x, ͕b, c, d, e, f͖ for node y, and ͕e, f, g, h͖ for node z, respectively. The global stratum of depth 1 is ͕a, b, c, d, e, f, g, h͖.
In initialization, buffers are created for PIs to supply inputs to the rest of the network. PIs are labeled 0 and buffers are labeled 1. In Figure 11 (a), nodes a, b, . . . , h are PI buffers. Gray regions represent the global strata of depth 1 and 2 in Figure 11 (a)-(c) and (d), respectively. The gate decomposition proceeds as follows:
(1) For each depth q and for each gate type f, the nodes under decomposition are collected into a set G q f . Then the global stratum of depth q, denoted S q , is computed by the union of local strata of depth q for all nodes in G q f . In Figure 11 (a), let f ϭ AND, we have G 1 f ϭ ͕x, y, z͖ and S 1 ϭ ͕a, b, c, d, e, f, g, h͖. Based on G q f and S q , we formulate the Global Stratum Bin-Packing (GSBP) problem (to be formally defined later). By solving the GSBP problem, we achieve (i) for each node in G q f , its local stratum of depth q is packed into min-height K-feasible bins, and (ii) there are a minimum number of min-height K-feasible bins in total. The second objective is achieved by packing common fanins for the nodes in G q f . Intermediate nodes (also called bin nodes) are created for bins. In Figure 11 (b), nodes b and c, e and f, g and h are packed into bin nodes i, j and k, respectively.
(2) It is possible that some nodes in G q f have been decomposed completely (e.g., nodes x and z in Figure 11(b) ), while the local strata of other nodes can be packed further (e.g., node y in Figure 11(b) ). Both G q f and S q are updated and a new instance of the GSBP problem for the same q value is formulated and solved. The process iterates until the global stratum of depth q has been minimally packed into bins (as a result, the network does not change). In Figure 11 (b), we have l͑x͒ ϭ 1, l͑z͒ ϭ 2, G 1 f ϭ ͕v, y͖, and S 1 ϭ ͕i, d, j, x͖. By solving the GSBP problem for the updated G q f and S 1 , node d and i are packed into a bin node m. Node y is now completely decomposed with a label l͑ y͒ ϭ 2. The Gate Decomposition and LUT Mapping • process iterates with updated G q f ϭ ͕v͖ and S 1 ϭ ͕x͖. But no further packing is possible for q ϭ 1 (see Figure 11 (c)).
(3) Buffer nodes are created and labeled q ϩ 1 for every fanin in the global strata S q . The decomposition process iterates steps (1) and (2) until the network is 2-bounded. In Figure 11 (d), a buffer node n is created for node x, nodes y and z are then packed into a bin, and the decomposition of node v is completed.
Two points are worth mentioning. First, in DOGMA, each node is decomposed only after all its fanins have been decomposed and labeled. In DOGMA-m, however, nodes could undergo decomposition, even though some of their fanins have not been labeled. For example, node v in Figure 11 
212
• under decomposition ͑v ʦ G 1 f ͒, while its fanin y is not labeled yet. Second, for each depth q and gate type f, multiple instances of the GSBP problem might be solved in order to pack local strata into a minimal number of bins. For example, two instances of the GSBP problem are solved for q ϭ 1 before the local stratum of node y is minimally packed (from Figure 11 (a) to (c)). In our experiments, we found that solving three instances of the GSBP problem are sufficient for each q value.
The Global Stratum Bin-Packing (GSBP) problem is formally defined as follows.
Global stratum bin-packing (GSBP) problem. Given a set G q f of nodes of gate type f under decomposition and a global stratum S q of depth q that contain fanins of nodes from G q f , pack the fanins in S q into a set of bins such that (i) for each node in G q f , its local stratum of depth q is packed into min-height K-feasible bins; (ii) there is a minimum number of min-height K-feasible bins in total.
To solve the GSBP problem, we build a matrix M where rows correspond to nodes in G q f ϭ ͕v 1 , v 2 , . . . , v n ͖, columns correspond to fanins in S q ϭ ͕u 1 , u 2 , . . . , u m ͖, an entry M͑i, j͒ ϭ 1 if u j ʦ input͑v i ͒, and M͑i, j͒ ϭ 0 if not. A rectangle is a subset of rows and columns, denoted by a pair ͑R, C͒, indicating the row and column subsets, where all entries are 1. C corresponds for each gate function type f do 
Gate Decomposition and LUT Mapping
• to a bin of fanins and R corresponds to a set of nodes that share fanins in C. A solution of the GSBP problem is a rectangle cover for M, subject to a K-feasible cut of height q Ϫ 1 exists for fanins in each column set C. This matrix representation is similar to the cube-literal matrix used for solving the cube-extraction problem [Rudell 1989; De Micheli 1994] . However, the algorithms for cube extraction cannot be applied directly because the C in every rectangle ͑R, C͒ must satisfy the K-feasible cut constraint.
We use the MC-FFD packing heuristic to compute a rectangle cover for the GSBP problem as follows. First, compute the fanout factor o j ϭ ⌺ iϭ1 n M͑i, j͒ and the cut size s j of min-cut of height q Ϫ 1 for every fanin u j ʦ S q . The weight of each fanin is o j ⅐ s j . Then we sort the fanins according to their weights and follow the MC-FFD bin-packing heuristic to pack fanins into bins (starting from the fanin with the largest weight). Our strategy is to group fanins of large cut sizes for obtaining a minimum number of bins and to group fanins of large fanout sizes for exploiting common fanins. A set of fanins can be packed into one bin C if (i) a K-feasible cut of height q Ϫ 1 exists for the fanins in C, and (ii) the largest rectangle ͑R, C͒ satisfies ԽRԽ Ն r min (i.e., at least r min nodes in G q f share these fanins) where r min is a user-specified parameter. By performing the MC-FFD packing heuristic, we obtain a set of rectangles. Each rectangle ͑R, C͒ that satisfies ԽCԽ Ն c min (another user-specified parameter) will be saved and covered with 0's in M. The MC-FFD packing procedure is repeated until M contains only 0's. A rectangle cover for M is then obtained, and the set C in each rectangle corresponds to a bin. In our implementation, we set r min ϭ 2 and c min ϭ 2 in the first pass of the MC-FFD packing procedure, and decrease both values to 1 in subsequent iterations. The decrease of values guarantees the termination of our procedure.
We demonstrate the MC-FFD packing heuristic on the network in Figure  11 (a) for K ϭ 3 for solving the GSBP problem. The initial matrix M is shown in Figure 13 (a). The rows correspond to nodes in G 1 f ϭ ͕x, y, z͖ and the columns correspond to fanins in S 1 ϭ ͕a, b, c, d, e, f, g, h͖. The weight of each fanin is its fanout size (i.e., the number of 1's in each column), since every fanin is a PI buffer whose cut size is 1. Fanins are 
214
• sorted into the order b, c, e, f, a, d, g, h according to their weights. Nodes b and c are packed into the first bin, which corresponds to the rectangle ͑R 1 , C 1 ͒ ϭ ͕͑x, y͖, ͕b, c͖͒. Although there is a 3-feasible cut of height 0 for nodes b, c, e, they cannot be packed into one bin because the rectangles for them have ԽRԽ ϭ Խ͕y͖Խ Ͻ r min ϭ 2. As a result, node e is put into a separate bin and packed with node f, which corresponds to the rectangle ͑R 2 , C 2 ͒ ϭ ͕͑y, z͖, ͕e, f͖͒. Then the two rectangles are covered with 0's ( Figure 13(b) ). We reset r min ϭ c min ϭ 1 and perform another run of the MC-FFD packing heuristic. Three bins are obtained but only one bin contains two fanins. Totally, three bin nodes will be created. The network in Figure 11 (a) is now decomposed into the network in Figure 11 (b).
EXPERIMENTAL RESULTS
We implemented DOGMA and DOGMA-m in the C language and incorporated them into the RASP logic synthesis system for FPGAs . We prepared two sets of benchmarks in our experiments. The first set C original consists of 24 original multilevel MCNC benchmarks, which all Gate Decomposition and LUT Mapping
• contain a large percentage of 2-unbounded gates (i.e., 3 or more inputs). We performed the rugged script in SIS [Sentovich et al. 1992 ] for technologyindependent optimization, and obtained the second set C rugged of benchmarks. Both sets of benchmarks were transformed into simple gate networks using AND-OR decomposition. Table II shows the circuit sizes and fanin distributions of the two sets of simple gate networks. The benchmark set C original contains 18,824 simple gates with 42% of them being 2-unbounded, while the benchmark set C rugged contains 13,007 simple gates with 28% of them 2-unbounded. Clearly, both circuit size and fanin size were reduced by performing the rugged script. The total runtime is less than six minutes on ULTRA2.
We compared DOGMA and DOGMA-m with three structural gate decomposition algorithms, as well as DOGMA-m with algebraic and Boolean decomposition approaches. The three structural gate decomposition algorithms used for comparison are the tech_decomp algorithm [Sentovich et al. 1992] ; the dmig algorithm [Wang 1989; Chen et al. 1992] ; and our implementation of the Chortle-d algorithm [Francis et al. 1991b] . After gate decomposition by each of these algorithms, CutMap [Cong and Hwang 1995] was employed to obtain depth-optimal mapping solutions. For a comparison across structural, algebraic, and Boolean gate decompositions, we employed DOGMA-m, speed_up in SIS [Sentovich 1992 ] and the TOS package [Legl et al. 1996a ] to perform decompositions, respectively. Again, CutMap was employed to perform LUT mapping, except for TOS, since it produced LUT networks directly. The objective of gate decomposition and LUT mapping in our experiments was to minimize mapping depth. CutMap also minimizes mapping area as the second objective. All experiments were performed on a Sun ULTRA2 workstation with 256M of memory.
We first demonstrate the impact of further gate decomposition on depth and area in technology mapping. According to Theorem 1, the mapping solution space expands regardless of the gate decomposition algorithm used. We use tech_decomp to decompose benchmarks in C rugged into 5-bounded networks, and subsequently into 2-bounded networks, followed by LUT mapping to obtain mapping solutions. The sizes of 5-bounded networks increase substantially compared to the 5-unbounded networks in C rugged . However, the percentages of 2-unbounded gates are about the same. We employed CutMap [Cong and Hwang 1995] and DFMap [Cong and Ding 1994b ] to produce depth-optimal and duplication-free area-optimal mapping solutions, respectively. In Table III , we see that both the optimal mapping depth (by CutMap) and the optimal duplication-free mapping area (by DFMap) are reduced by 16% when the 5-bounded networks are further decomposed into 2-bounded networks.
We compared five structural gate decomposition algorithms (tech_decomp, dmig, Chortle-d, DOGMA, and DOGMA-m) next, on benchmarks in C original and C rugged , using CutMap as the mapping engine. The depth and area of mapping solutions as well as the runtimes of the compared algorithms (not including CutMap time) for the two sets of benchmarks are presented in Tables IV and V. In comparison to DOGMA-m, we see that the other four algorithms result in up to 11% larger mapping depth and 50% larger mapping area on the benchmark set C original , and up to to 16% larger mapping depth and 10% larger mapping area on the benchmark set C rugged . The differences in mapping depth obtained by DOGMA-m, dmig, or DOGMA are marginal, while the differences in mapping area are more significant. Regarding runtime, DOGMA-m is comparable to DOGMA runtime, but is 8 to 33 times slower than the runtimes of the other three algorithms. However, DOGMA-m runtime is in the same order of magnitude as the time spent in performing the rugged script or CutMap.
Comparing Tables IV and V , we see that the mapping area for C rugged is 30% to 50% smaller, while the mapping depth for C rugged is 1% to 7% larger than that for C original . It shows that the rugged script, which performs logic optimization based on algebraic divisions, is very effective for area minimization, but not as effective for depth minimization. A benefit resulting from the area reduction is the significant decrease of runtime for all decomposition algorithms. For benchmarks in C rugged , DOGMA-m results in 10% smaller Gate Decomposition and LUT Mapping
• area compared to the other four algorithms. It shows that DOGMA-m can exploit common fanins for area minimization, in addition to the rugged script. Finally, we employed DOGMA-m, speed_up and TOS for a comparison across structural, algebraic, and Boolean gate decomposition approaches. We configured TOS for delay-oriented synthesis in the medium-effort mode performing both single-output (TOS-s) and multioutput (TOS-m) functional decompositions. The input circuits to TOS were prepared as follows. We first tried to collapse each benchmark in C rugged into a flat logic network within 30 minutes of CPU time. If this could not be done, we used the reduce_de-pth -depth d command in TOS to collapse benchmarks into networks of the smallest depth d where d Ն 2. We allocated 30 minutes of CPU time for each depth d starting from d ϭ 2 and up. Among all benchmarks after collapsing, rot and C880 have a depth of 2, C432, C2670, C5315, and C7552 have a depth of 3, 3540 and i10 have a depth of 4, and C6288 has a depth of 6. The remaining benchmarks are completely collapsed.
Table VI collects the mapping results from DOGMA-m ϩ CutMap, speed_up ϩ CutMap, TOS-s, and TOS-m. Subtotal 1, subtotal 2, and subtotal 3 are totals of the mapping results for benchmarks that speed_up, TOS-s, and TOS-m finish, respectively, and the ratios measure the relative 
218
• performance of these approaches with respect to DOGMA-m ϩ CutMap. Time T(s) reports the computation time in seconds. In Table VI , we see that DOGMA-m ϩ CutMap is able to map all benchmarks in 23 minutes, while speed_up ϩ CutMap, TOS-s, and TOS-m fail to map some benchmarks in 2 hours. Compared to DOGMA-m ϩ CutMap, speed_up ϩ CutMap takes more than 5 hours (98% consumed by speed_up) to map 23 benchmarks (not including des), but obtains significantly better results: 13% smaller mapping depth and 6% smaller mapping area. The results for C432 show the largest contrast between the performance of speed_up and the efficiency of DOGMA-m. speed_up results in a mapping depth of 8 in more than 2 hours, while DOGMA-m results in a mapping depth of 11 in 6.6 seconds. TOS-s and TOS-m do not return mapping solutions in allocated CPU times for 3 and 8 benchmarks, respectively. Compared to the other two approaches, TOS-s obtains smaller mapping depth on count, 9symml, alu2, alu4, and t481. TOS-m obtains a smaller mapping area on 9symml, cordic, x1, alu2, and t481. It is worth noting that TOS is extremely successful for 9symml and t481. The results indicate that mapping approaches based on functional decomposition require longer computation time to obtain good results, especially on circuits of medium to large sizes. Gate Decomposition and LUT Mapping • Overall, from these experiments, we conclude that DOGMA-m can obtain the best mapping results among the five structural gate decomposition algorithms, and is much more efficient in terms of runtime (over 17 and 30 times faster) compared to the algebraic decomposition algorithm speed_up and the functional decomposition approach TOS. However, speed_up obtains the best results among the approaches.
CONCLUSION
In this paper we presented an in-depth study of structural gate decomposition for depth-optimal technology mapping in LUT-based FPGA designs. We show that any structural gate decomposition in K-bounded networks can only result in a smaller depth in K-LUT mapping solutions, regardless of the decomposition algorithm used. Therefore, it is always beneficial to decompose circuits into 2-bounded networks for depth minimization when structural decompositions are applied. We prove that the structural gate decomposition problem in-depth-optimal technology mapping is NP-hard for K-unbounded networks when the LUT input size K Ն 3, and remains NP-hard for K-bounded networks when K Ն 5. We propose two new algorithms, called DOGMA and DOGMA-m, which combine the level-driven node-packing technique in Chortle-d and the network flow-based labeling technique in FlowMap, for structural gate decomposition. DOGMA-m decomposes multiple gates simultaneously to exploit common fanins. The following experimental results were observed: First, the optimal mapping depth and the optimal duplication-free mapping area can be reduced when 5-bounded networks are decomposed structurally into 2-bounded networks. Second, applying the rugged script for technology-independent logic optimization before technology mapping can result in a 40% to 50% reduction, with only a marginal increase in depth, while significantly reducing the runtime of the structural decomposition algorithms. Third, DOGMA-m results in the smallest mapping depth and mapping area among the five structural gate decomposition algorithms. Finally, comparing three algorithms, DOGMA-m, speed_up, and TOS, which take, respectively, structural, algebraic, and Boolean (functional decomposition) gate decomposition approaches, DOGMA-m can decompose all tested benchmarks in a short time, while speed_up and TOS fail to obtain results on some benchmarks. However, speed_up results in 13% smaller depth and 6% smaller area final mapping solutions, compared to DOGMA-m.
APPENDIX: Proofs
We prove Lemma 5 and Lemma 6 here. In Figure 5 , every internal node in N͑x i ͒ has a mapping depth of 1, except node s i has a mapping depth of 2. The mapping depth of nodes x i and x i depend on the way node s i is decomposed. In Figure 6 , nodes q j k ͑1 Յ k Յ 2K Ϫ 5͒ in N͑C j ͒ have a mapping depth of 2, since each of them is the root of a complete 2-level K-ary tree with PI nodes as leaves. Similarly, nodes r j k ͑1 Յ k Յ K Ϫ 2͒ have a mapping depth of 3. The mapping depth of node C j in N͑C j ͒ depends on the mapping depth of its three literal nodes l j 1 , l j 2 , l j 3 . Since every literal node has single fanin from its variable node, the mapping depth of nodes C j ͑1 Յ j Յ m͒ depend on how nodes s i ͑1 Յ i Յ n͒ are decomposed. The network N͑F ͒ has m primary outputs. Let D͑N͑F ͒͒ be a decomposed network of N͑F ͒. Then MMD͑D͑N͑F ͒͒͒, the mapping depth of D͑N͑F ͒͒, depends on how the node s i is decomposed in every N͑x i ͒. PROOF. Node s i can be decomposed in six different ways, as shown in Figure 14 . In Figure 14 (a and b) Figure 14 (c), the min-cut of height 1 for node s i has a size of 4. Hence MMD D͑N͑xi͒͒ ͑x i ͒ ϭ MMD D͑N͑xi͒͒ ͑x i ͒ ϭ 3. In Figure 14 (d and e) , the min-cut of height 1 for node s i has a size of 3. But due to the edge ͑w i 2 , x i ͒, the min-cut of height 1 for node x i still has a size of K ϩ 1. Hence MMD D͑N͑xi͒͒ ͑x i ͒ ϭ MMD D͑N͑xi͒͒ ͑x i ͒ ϭ 3. In Figure 14 PROOF. Because every literal node has only one fanin, the mapping depth of a literal node is equal to the mapping depth of its variable node. In subnetwork N͑C j ͒, assume one literal node has a mapping depth equal to 2 (e.g. l j 3 in Figure 6(b) ). For the other two literal nodes of mapping depth 3, there exist 2-feasible cuts of height 2 for each of them (Proposition 2). Together with nodes q j 1 , . . . , q j 2KϪ5 in the fanins of node C j , there will be 2K nodes of mapping depth 2. Hence two intermediate nodes of mapping depth 3 will be introduced during the decomposition of node C j . Since node C j has K Ϫ 2 fanins of mapping depth 3 (i.e., r j 1 , . . . , r j KϪ2 ), a K-feasible cut of height 3 exists for node C j . Hence MMD D͑N͑Cj͒͒ ͑C j ͒ ϭ 4 (see Figure  6(b) ). If C j contains more than one literal node of mapping depth 2, the result still holds. One the other hand, if all three literal nodes have mapping depth equal to 3, there will be at least 2K ϩ 1 nodes of mapping depth 2 in the fanin cone of node C j . Three intermediate nodes of mapping depth 3 must be introduced during the decomposition of node C j . A K-feasible cut of height 3 for node C j becomes impossible. So MMD D͑N͑Cj͒͒ ⅐͑C j ͒ ϭ 5. This proves Proposition 3. e
We show the correspondence betwen the truth assignment of variables in an instance F of 3SAT and the decomposition of N͑x i ͒, 1 Յ i Յ n, as follows: variable x i ϭ 1 in 3SAT if and only if MMD D͑N͑si͒͒ ͑x i ͒ ϭ 2, and variable x i ϭ 0 if and only if MMD D͑N͑si͒͒ ͑x i ͒ ϭ 2. We now prove Lemma 5 by this correspondence.
PROOF OF LEMMA 5. (If) . It is obvious MMD͑D͑N͑F ͒͒͒ ϭ 4 if and only if MMD D͑N͑Cj͒͒ ͑C j ͒ for 1 Յ j Յ m. This implies at least one literal node (and its variable node) in each subnetwork N͑C j ͒ has a mapping depth of 2 (Proposition 3). According to the correspondence, we can obtain a truth assignment of variables from the literal nodes of mapping depth 2. Since each N͑C j ͒ contains at least one literal node of mapping depth 2, each clause C j contains a literal of value 1. Hence F is satisfied by the truth assignment.
(Only if). Assume F is satisfiable. Then there is a truth assignment of variables that allows at least one literal in each clause to have a a value of 1. According to the correspondence, this truth assignment of variables specifies how each subnetwork N͑x i ͒͑1 Յ i Յ n͒ is decomposed and determines the mapping depth of nodes x i and x i in each N͑x i ͒. Literal nodes have the same mapping depth as their variable nodes. Since each clause C j contains a literal of true value, each N͑C j ͒ contains a literal node of mapping depth 2. Hence MMD͑D͑N͑F ͒͒͒ ϭ 4. This proves Lemma 5. e PROOF 10. When the subnetwork N K ͑C j ͒ contains at least one literal node of mapping depth 2, together with nodes q j 1 , . . . , q j KϪ5 , a K-feasible cut of height 2 exists for node C j . Hence node C j has a mapping depth equal to 3. Otherwise, node C j has a mapping depth of 4. e PROOF OF LEMMA 6. Similar to the proof of Lemma 5, and is omitted. e
