In this paper, we study the problem of decomposing gates in fanin-unbounded or K-bounded networks such that the K-input LUT mapping solutions computed by a depthoptimal mapper have minimum depth. We show (1) any decomposition leads to a smaller or equal mapping depth regardless the decomposition algorithm used, and (2) the problem is NP-hard for unbounded networks when K ≥ 3 and remains NP-hard for K-bounded networks when K ≥ 5. We propose a gate decomposition algorithm, named DOGMA, which combines level-driven node packing technique (Chortle-d) and the network flow based optimal labeling technique (FlowMap). Experimental results show that networks decomposed by DOGMA allow depthoptimal technology mappers to improve the mapping solutions by up to 11% in depth and up to 35% in area comparing to the mapping results of networks decomposed by other existing decomposition algorithms.
Introduction
The lookup-table (LUT) based FPGAs have been a popular technology for VLSI ASIC design and system prototyping. A K-input LUT (K-LUT) can implement any function of up to K variables. The goal of LUT-based FPGA technology mapping is to cover a given network using LUTs such that either area or delay is minimized or routability is maximized in the final LUT network. The delay of a network can be estimated by the number of levels (i.e. depth). Two factors affect the mapping solution depth: the gate decomposition before mapping and the mapping algorithm. Several LUT mapping algorithms have been proposed for depth minimization [6, 9, 2] . In particular, the FlowMap algorithm [2] guarantees a depth-optimal mapping solution for any K-bounded network. However, FlowMap can not be applied directly to unbounded networks. Gate decomposition can be classified into structural decomposition and Boolean decomposition. The structural decomposition replaces multi-fanin (simple) gates by fanin trees while the Boolean decomposition decomposes the functionality of gates. This paper focuses on structural decomposition for depth minimization in LUT mapping.
Gate decomposition affects the mapping solution depth significantly. For example, assume K = 3. The network N in Figure 1(a) is not K-bounded. If node v is decomposed in the way shown in Figure 1(b) , there is no way to obtain a mapping solution of depth less than 3. However, if the decomposition shown in Figure 1 (c) is carried out for node v, a mapping solution of depth equal to 2 can be obtained. Even for K-bounded networks, the depth of mapping solutions computed by FlowMap may decrease if gates are further decomposed before mapping [4] .
Several gate decomposition routines have been used for LUT-mapping. The tech_decomp and the speed_up in SIS [10] and the dmig in [1] focus on minimizing the number of levels in the decomposed network. They do not directly minimize the depth of the mapping solution. Chortle-d [6] computes depth-optimal gate decomposition and mapping solutions for tree networks (may be unbounded) but produces suboptimal results for general networks. In this paper, we study the structural gate decomposition problem to decompose gates in a general network such that a mapping solution of minimum depth can be obtained.
The remainder of this paper is organized as follows. Section 2 defines basic terminology, presents properties of structural gate decomposition and formulates the problem. Section 3 gives the complexity results. A novel gate decomposition algorithm for depth-optimal mapping is presented in Section 4. Experimental results are presented in Section 5 and Section 6 concludes the paper. 
Problem Formulation
Let K be the input size of an LUT. Let input (v) be the set of fanin nodes of node v. A primary input (PI) node has no fanins and a primary output (PO) node has no outgoing edges. A network N is K-bounded if every node v ∈ N satisfies | input (v) | ≤ K. Otherwise, it is an unbounded network. Given a subnetwork H, we use input (H) to denote the set of distinct nodes outside H which supply inputs to nodes in H. 
, is the maximum mapping depth for nodes in n (X v ,X v ). We have the following lemma based on results in [2] . Lemma 2 Given a network N, any decomposition algorithm D, and any node v ∈ N, it must be true that
According to Lemma 2, the further a network is decomposed, the smaller the mapping depth might be. Therefore, we decompose every gate into a binary fanin tree. Figure 2 (c) is a complete decomposition of node v. Every decomposition step introduces one intermediate node and it requires | input (v) | − 2 steps to decompose v. We formulate the following problems.
Structural Gate Decomposition for K-LUT Mapping
(SGD/K) Given a simple-gate unbounded network N ∞ , decompose N ∞ into a 2-input network N 2 such that for any other 2-input network decomposition N′ 2 of N, MMD (N 2 ) ≤ MMD (N′ 2 ).
Structural Gate Decomposition for K-LUT Mapping of
There are two issues related to the problem of gate decomposition. (1) A smaller depth might be obtained when several gates are decomposed simultaneously instead of independently [4] . This is because the intermediate nodes could be shared during the decomposition of multiple gates of the same functional type. (2) Gate decomposition can be performed before the mapping phase in a two-step approach or embedded into the mapping process being part of an integrated approach. For example, Chortle-d [6] is an integrated approach (since it decomposes gates and maps LUTs in an interleaving manner) while dmig + FlowMap in [2] uses a two-step approach. We can show that the best two-step approach produces the same optimal mapping depth as that by the best integrated approach [4] . In this paper, we consider only independent gate decompositions in a two-step approach. Nevertheless, our gate decomposition algorithm takes into account the impact of gate decomposition on mapping to obtain a decomposed network which is most suitable for FlowMap to achieve a minimum mapping depth.
Complexity of SGD/K and K-SGD/K Problems
In this section, we only state our complexity theorems. Complete proofs can be found in [4] .
Theorem 1
The SGD/K problem is NP-hard for K ≥ 3.
Theorem 2
The K-SGD/K problem is NP-hard for K ≥ 5.
Gate Decomposition Algorithm for Depth-Optimal LUT Mapping
Our decomposition algorithm, named DOGMA (Depth-Optimal Gate decomposition for MApping), combines the level-driven node packing technique in Chortle-d and the network flow based labeling technique in FlowMap. Given a network N, DOGMA decomposes nodes from PIs to POs in a topological order. Let N (v) denote the network after decomposing the node v. DOGMA labels each node v by MMD N (v) (v) as follows. Nodes in input (v) are grouped in such a way that each group consists of nodes with the same label (i.e. minimum mapping depth). Groups are processed in an ascending order according to their labels. For a group of nodes labeled p, nodes are packed into a minimum number of bins such that a K-feasible cut of height p − 1 exists for the nodes in each bin (checked based on Lemma 1). Such a bin is called a min-height K-feasible bin. A node u i is created for each bin B i with fanins from nodes in B i and a fanout to v. Node u i will be given a label p. Note that according to Lemma 3, no matter u i is further decomposed or not, the minimum mapping depth of the network is always the same. We then proceed to the group of a next higher label p + 1. For each node u i created in the previous step, a buffer node w i (with label p + 1) is created with u i as input. (All the buffer nodes will be removed after decomposition). These buffer nodes together with nodes in the group of label p + 1 are again packed into a minimal number of min-height Kfeasible bins. We continue this process until all nodes are packed into one bin which corresponds to the node v.
DOGMA is similar to Chortle-d in that decomposition is done by packing nodes into a minimal number of min-height K-feasible bins. However, Chortle-d integrates gate decomposition with technology mapping, and computes mapping depth based on the partially generated LUT network. Since the fanin constraint is not a monotone clustering constraint [2] , Chortle-d may obtain inaccurate node mapping depth. Besides, Chortle-d enumerates all packing combinations for nodes on reconvergent paths, which is quite expensive. In contrast, DOGMA computes mapping depth as well as packs nodes into bins (to be discussed) using the network flow based computation. The mapping depth is always accurate and the reconvergent paths are taken into account naturally.
The problem remains to be solved is the min-height K-feasible bin packing problem defined as follows.
Min-Height K-Feasible Bin Packing Problem
Given a set S p ⊆ input (v) of nodes of minimum mapping depth p when decomposing node v, partition S p into a minimal number of bins such that there is a K-feasible cut of height p − 1 for nodes in each bin.
We shall give three heuristics and one exact method to solve the problem. Our heuristics are based on the maxflow algorithm and bin-packing heuristics. We define the total cut size TC p (S) of a set S of nodes with label p to be the size of the min-cut of height p − 1 which separates S from all PIs. A set X of nodes of label p can be packed into one bin as long as TC p (X) ≤ K. Nodes are packed in a decreasing order of individual cut sizes. The first two heuristics, named MC-FFD and MC-BFD, to the min-height K-feasible bin packing problem are based on the first-fitdecreasing (FFD) and best-fit-decreasing (BFD), which are two heuristics for the bin packing problem [8] . The third method is the maximal-share-decreasing (MC-MSD) heuristic which packs nodes that can maximally share a cut together. The fourth method is inspired by the dynamic programming approach for the number partitioning problem [7] . Instead of partitioning numbers, we ask whether there is a way to partition the nodes in S p into k subsets
We can solve the problem by dynamic programming. By searching the minimal k (k = 2,3, . . . ) , the min-height K-feasible bin packing problem can be solved optimally. We refer to this method as the MC-DP algorithm. The details of these algorithms can be found in [4] .
Experimental Results
We have implemented the DOGMA algorithm with MC-FFD, MC-BFD, MC-MSD, and MC-DP packing methods using the C language and incorporated our implementation into the RASP FPGA synthesis system [5] .
In our experiments, we optimize the MCNC benchmark circuits for area using standard SIS scripts, decompose them into simple gate networks, apply gate decomposition routines to obtain 2-input networks, and obtain the final LUT networks using a depth-optimal technology mapper. We choose K = 5 in the experiments.
We compare the performance of our four methods for the min-height K-feasible bin packing in DOGMA and observe that the impact of the four methods on mapping results is almost the same. Since MC-FFD is faster than other three methods, DOGMA employs MC-FFD to solve the packing problem. We compare DOGMA with two decomposition routines: the tech_decomp in SIS [10] and the dmig in [1] . The tech_decomp routine is based on a balanced-tree heuristic which only minimizes the gate level locally. The dmig routine minimizes the gate level of the decomposed networks. The decomposed networks by the three algorithms are all mapped by CutMap [3] , an enhancement of FlowMap. The results are shown in Table  1 . We see CutMap produces the same or smaller depth for circuits decomposed by DOGMA. On average, DOGMA allows CutMap to achieve 10% and 4% depth reduction comparing to tech_decomp and dmig, respectively.
We compare DOGMA + CutMap with existing gate decomposition and depth-oriented mapping algorithms. The tested circuits are area-optimized MCNC benchmarks. The mappers TechMap-D [9] , FlowMap [2] , and CutMap [3] are used for comparison. The dmig was used in [2] while the speed_up was used in [9] and [3] to prepare 2-input networks for technology mapping. The results are in Table 2 . Comparing results from [2, 3] 
Conclusion
In this paper, we study the structural gate decomposition for depth-optimal LUT mapping. We show gate decomposition leads to a smaller or equal mapping depth regardless the decomposition algorithm used, and the problem is NP-hard for unbounded networks when K ≥ 3 and remains NP-hard for K-bounded networks when K ≥ 5. We propose a gate decomposition algorithm (DOGMA) for depth-optimal mapping. Experimental results show a reduction of up to 10% in mapping depth. Together with CutMap, we achieve comparable mapping depth with up to 35% reduction in area comparing to previous results.
