Abstract-In order to maximize performance and device utilization, the recent generation of field programmable gate arrays (FPGAs) take advantage of speed and density benefits resulting from heterogeneous FPGAs, which can be classified into heterogeneous FPGAs without bounded resources or heterogeneous FPGAs with bounded resources. In this paper, we study the technology mapping problem for heterogeneous FPGAs with or without bounded resources under the objective of delay optimization. We present the first polynomial-time delay optimal technology mapping algorithm, named HeteroMap, for heterogeneous FPGAs without bounded resources. Taking different delays of heterogeneous lookup tables (LUTs) into consideration, the HeteroMap algorithm computes the minimum mapping delay of a circuit based on a series of minimum-height -feasible cut computations at each node in the circuit. We then study the technology mapping problem for delay minimization for heterogeneous FPGAs with bounded resources. We show that this problem is NP-hard for general networks, in contrast to the delay minimization mapping problem for heterogeneous FPGAs without bounded resources, but can be solved optimally in pseudopolynomial time for trees. We then present two heuristic algorithms to solve this problem for general networks. We have successfully applied these algorithms on MCNC benchmarks on commercial FPGAs. Encouraging results on delay and area reduction are reported.
I. INTRODUCTION
T HE short design cycle and low nonrecurring engineering (NRE) cost have made the field programmable gate arrays (FPGAs) an important technology in very large scale integration (VLSI) designs. In a traditional lookup table (LUT)-based FPGA device, the basic programmable logic block (PLB) is a -input LUT ( -LUT) which can implement any Boolean function of up to variables. In order to maximize performance and device utilization, recent generations of FPGAs take advantage of speed and density benefits resulting from heterogeneous FPGAs, which provide either an array of homogeneous PLBs, each configured to implement circuits with LUTs of different sizes, or an array of physically heterogeneous LUTs. For example, the PLBs in Xilinx XC4000 series FPGAs [20] , Lucent ORCA2C series FPGAs [16] , and the recently announced Vantis VF1 FPGAs [1] 1 can be configured to have heterogeneous LUTs. These heterogeneous FPGAs do not have limitations on the availability of LUTs of specific configurable sizes (as long as within the chip capacity) due to their PLB configuration flexibility. On the other hand, Altera FLEX10K devices [2] (see Fig. 1 ) provide both a logic array of normal -LUTs and an embedded memory array with a series of embedded memory blocks (EMBs) which, if not used as on-chip memories, can be used to implement logic functions. These heterogeneous FPGAs have limitations on one or several types of LUTs. For example, in one FLEX10K device chip, there are 3-12 EMBs 2 according to the device size, and each EMB can be used as an 11-LUT. Therefore, heterogeneous FPGAs can be classified as heterogeneous FPGAs without bounded resources or heterogeneous FPGAs with bounded resources. In a heterogeneous FPGA, larger LUTs can cover more gates, but usually have longer delays. Given a heterogeneous FPGA with or without bounded resources, an important problem is how to utilize the available heterogeneous LUTs to minimize the overall circuit delay and/or area during technology mapping. For example, assuming each 4-LUT has delay of 1.0 and each 5-LUT has delay of 1.5, the circuit depicted in Fig. 2 (a) can be mapped into five homogeneous 4-LUTs, with the total mapping delay of 3.0 [see Fig. 2(b) ]. In Fig. 2(c) , the same circuit can be mapped into two 4-LUTs and one 5-LUT, with a total mapping delay of 2.5. In general, given a heterogeneous FPGA with LUTs of different sizes, we want to find an optimal mapping solution with the minimum delay or area. In this paper, we study the technology mapping problem for heterogeneous FPGAs with or without bounded resources under the objective of delay optimization.
In the past few years, extensive studies have been done on technology mapping for LUT-based FPGAs. A survey of these results can be found in [6] . However, most of these algorithms are not able to deal with the delay optimization problem for heterogeneous FPGAs, as they assume the identical capacity and delay for every LUT. In [14] , a heuristic approach for technology mapping into heterogeneous LUT-based FPGAs was presented for area minimization, and their architecture assumes a mixture of only two types of independent LUTs with a fixed ratio in each target PLB. The recent work in [15] shows that the same area minimization mapping problem for a tree network can be solved optimally in time. However, the optimality holds only for trees, which significantly limits the application of this algorithm. In [9] and [19] , it was shown that uncommitted EMBs in heterogeneous FPGAs can be efficiently used to implement logic for area minimization. The algorithm in [9] can further guarantee that the circuit delay will not increase while using available EMBs for area reduction.
In this paper, we present the first polynomial-time delay optimal technology mapping algorithm, named HeteroMap, for heterogeneous FPGAs without bounded resources. Taking different delays of heterogeneous LUTs into consideration, the HeteroMap algorithm computes the minimum mapping delay of a circuit based on a series of minimum-height -feasible cut 3 computations at each node in the circuit. For a heterogeneous FPGA consisting of -LUTs -LUTs and -LUTs, HeteroMap computes the minimum delay mapping solution in time for a circuit netlist with gates and edges. HeteroMap also effectively minimizes the area of the mapping solution by maximizing the volume of each cut and by the post-mapping packing operations. We then study the technology mapping problem for delay minimization for heterogeneous FPGAs with bounded resources, where the HeteroMap algorithm cannot be applied directly, since HeteroMap 3 A minimum-height K-feasible cut is a K-feasible cut with the minimum height. The concepts of K-feasible cut and the height of a cut are to be defined formally in Section II.
cannot constrain the use of certain types of LUTs. We show that this problem is NP-hard for general networks, in contrast to the delay minimization mapping problem for heterogeneous FPGAs without bounded resources, but can be solved optimally in pseudopolynomial time for trees. We then present two heuristic algorithms, named BinaryHM and CN-HM, to solve this problem for general networks. HeteroMap, BinaryHM, and CN-HM produce favorable results on MCNC benchmarks on commercial heterogeneous FPGAs.
The remainder of this paper is organized as follows. Section II presents the problem formulation and preliminaries. Section III describes our delay optimal technology mapping algorithm for heterogeneous FPGAs without bounded resources. Section IV studies the delay minimization technology mapping problem for heterogeneous FPGAs with bounded resources. Experimental results and comparative study are presented in Section V. Section VI concludes the paper. Preliminary results in this paper were presented as extended abstracts in [10] and [11] .
II. PROBLEM FORMULATION AND PRELIMINARIES
A Boolean network can be represented as a directed acyclic graph (DAG) where each node represents a logic gate, 4 and a directed edge exists if the output of gate is an input of gate . Primary input (PI) nodes have no incoming edge, and primary output (PO) nodes have no outgoing edge. We use to denote the set of fanins of gate . Given a subgraph of the Boolean network, let denote the set of distinct nodes outside which supply inputs to the gates in . For a node in the network, a -feasible cone at , denoted , is a subgraph consisting of and its predecessors, 5 such that and any path connecting a node in and lies entirely in . The level of a node is the length of the longest path from any PI node to . The level of a PI node is zero. The depth of a network is the largest node level in the network. A Boolean network is -bounded if for each node in the network.
We assume that a general LUT-based heterogeneous FPGA consists of types of LUTs of --and -LUT , with delays and ( , and may not be integer), and with resource bounds and , one for each type of LUTs, respectively. Without loss of generality, is scaled to one in remaining discussions.
For heterogeneous FPGAs without bounded resources, all of s are . Homogeneous FPGAs can be viewed as the special heterogeneous FPGA with only one type of LUTs.
For a circuit mapped into a heterogeneous FPGA, we assume different access delays for heterogeneous LUTs but a constant delay for the interconnection, 6 which is called heterogeneous 4 In the rest of the paper, gate and node are used interchangeably for Boolean networks. 5 u is a predecessor of v if there is a directed path from u to v. 6 The constant interconnection delay can be counted into the LUT delays such that the interconnection delay can be set to zero. In general, interconnection delays are highly dependent on the placement result. We choose to consider interconnection delay as constant as there is no good delay model available so far which is able to accurately estimate interconnection delay in the logic synthesis phase. See Section VI for more discussions.
LUT-delay model. The unit-delay model used in [4] for homogeneous FPGAs is a special case of the heterogeneous LUT-delay model.
Given these definitions, the technology mapping problem for heterogeneous FPGAs can be formulated as follows: Given a -bounded Boolean network and the heterogeneous FPGA with or without bounded resources, transform to an equivalent LUT network by making use of the available heterogeneous LUT resources such that the circuit delay and/or area are minimized. The corresponding technology mapping problem for delay minimization for heterogeneous FPGAs with/without bounded resources can be abbreviated as problem DM-HF-UR and problem DM-HF-BR, respectively.
In this paper, our primary objective is to minimize the circuit mapping delay under the heterogeneous LUT-delay model through technology mapping. Therefore, a mapping solution is said to be optimal if the mapping delay is minimized. The secondary objective is to reduce the area used in the technology mapping solution. Fig. 2(b) shows an example of mapping a combinational circuit into uniform 4-LUTs, which is done by FlowMap [4] , while Fig. 2(c) illustrates how the circuit is to be mapped into a heterogeneous FPGA with both 4-LUTs and 5-LUTs, which can be achieved by the HeteroMap algorithm presented in Section III.
Given a network with a source and a sink , a is a partition of the nodes in such that and . The node cut set of of , denoted , is the set of nodes in that are adjacent to some node in , i.e., and The node cut-size of , denoted , is the number of nodes in . A cut is -feasible if its node cut-size is no more than , i.e., . Assuming that each edge has a nonnegative capacity , then the edge cut-size of , denoted , is the sum of the capacities of the edges that go from to , i.e., Moreover, assuming that there is a given label 7 associated with each node , the height of a cut , denoted , is defined to be the maximum label of the nodes in . The volume of a cut , denoted , is the number of nodes in , i.e.,
. Assuming that each edge has the capacity of one, Fig. 3 shows a cut in a network with given node labels, where , and . The darkened nodes are in the node cut set.
III. DELAY OPTIMAL TECHNOLOGY MAPPING FOR HETEROGENEOUS FPGAs WITHOUT BOUNDED RESOURCES
In this section, we present a delay optimal technology mapping algorithm, named HeteroMap, for heterogeneous FPGAs 7 l(v) will be used later on to denote the minimum mapping delay at node v. The computation procedure for l(v) will be described in Section III. Note that l(v) may not be an interger. without bounded resources under the heterogeneous LUT-delay model. It is applicable to any -bounded Boolean network. Given a general Boolean network as input, if it is not -bounded , we first transform it into a two-input simple gate network by using any of the decomposition algorithms, such as tech decomp in SIS [17] , DMIG [3] , and DOGMA [12] . The optimality of our algorithm holds not only for two-input simple gate networks, but for any -bounded general Boolean network as well. The HeteroMap algorithm has two phases. In the first phase (Section III-A), according to the topological order from PI to PO, HeteroMap uses the dynamic programing technique to compute the label for each node, which is the delay of the node if implemented by a LUT in a delay-optimal mapping solution. In the second phase (Section III-B), according to the reverse topological order starting from POs, HeteroMap generates the heterogeneous LUT mapping solution based on the node labels and cuts computed in the first phase. Our algorithm also minimizes the circuit area by maximizing the volume of each cut and by the post-mapping packing operations (Section III-C). The details are discussed in the following subsections.
A. Labeling Phase
Given a -bounded Boolean network , let denote the subnetwork consisting of node and all the predecessors of . The label of , denoted , is the delay of in the optimal heterogeneous LUT mapping solution of . The first phase of our algorithm computes the labels for all the nodes in , according to the topological order starting from the PIs. The topological order guarantees that every node is processed after all of its predecessors have been processed. For each PI node , we assign . Suppose that is the current node being processed. Then, for each node in , the label must have been computed. By including in an auxiliary node and connecting to all the PI nodes in , we obtain a network with as the source and as the sink. For simplicity, we still denote it as . Fig. 4 (a) shows a Boolean network in which gate is to be labeled. Fig. 4(b) shows the construction of the network . Assuming that is the -LUT that implements node in an optimal mapping solution of under the heterogeneous LUT-delay model, must be or . Suppose that we can compute for each and , which is the minimum delay of node in the optimal mapping solution of if is implemented by a -LUT. Then, label is the minimum one among all . That is
In order to compute for each and , suppose be the -LUT that implements node . Let denote the set of nodes in and denote the remaining nodes in . Then, forms a -feasible cut between and in because the number of inputs of is no more than . Moreover, . Therefore, in order to minimize , we want to find a minimum-height -feasible cut in . Equation (1) can be rewritten as
Based on the above discussion, we have the following:
The label computed by (1) gives the minimum delay of any mapping solution of under the heterogeneous LUT-delay model.
In Fig. 4 , we assume that there are two types of LUTs, 4-LUTs and 5-LUTs, with delays of 1.0 and 1.5, in a heterogeneous FPGA device. Fig. 4 (b) shows a minimum-height 4-feasible cut, which leads to 3. Fig. 4 (c) shows a minimumheight 5-feasible cut, which leads to 2.5. Then, 2.5, and the LUT that implements in the optimal mapping solution is a 5-LUT as shown in Fig. 4 
(d).
Lemma 2 (Monotone Property): Let be the label of node and be the label of a predecessor of , then . Proof: Suppose that the type of LUT that node uses in the optimal mapping solution of is a -LUT. Then the cut formed by this -LUT in is a minimum-height -feasible cut in , and
. Since node is a predecessor of node also determines a -feasible cut in with , where and . Moreover, according to (2) , . Therefore, . According to (1) and (2), the key problem in computing for node is to compute the minimum-height -feasible cut in to get , where and . The computation of a minimum-height -feasible cut in the case of the heterogeneous LUT-delay model is more complicated than that under the unit-delay model, although is monotone. First of all, we need to determine the range of the minimum height of a -feasible cut in , denoted . Lemma 3: Let be a non-PI node in network , then
Proof: Since is a -feasible cut in . According to Lemma 2,  . Therefore, we can derive the range of to be . Based on (3), similar to the approach in [7] for homogeneous LUT mapping with the arbitrary fixed net delay model, we construct a sorted height array, denoted , 8 which includes all the distinct labels that are in the range of (3). The size of the height array is at most . We can perform a binary search over this sorted height array to determine the minimum height of which a -feasible cut is available, and hence . Whether has a -feasible cut of height or not can be tested efficiently using the following method [4] . We first apply a network transformation on that collapses all the nodes in with label , together with , into the new sink . Denote the resulted network as . It is proved in [4] that has a -feasible cut of height , if and only if has a -feasible cut. To test whether there is a -feasible cut in , we can compute the min-cut in and check whether its cut size is no more than . In order to compute the min-cut in , we apply the approach proposed in [4] , which converts the node cut problem on into a standard edge cut computation problem on a flow network transformed from . The edge cut problem can be solved by the augmenting path algorithm for max-flow min-cut computation. Assuming that the number of edges in is , we can determine in time whether has a -feasible cut of height , and find one if such a cut exists. Since we can find the -feasible cut of minimum height in by performing the binary search on the sorted height array whose size is bounded by , we have the following. Lemma 4: A minimum-height -feasible cut in under heterogeneous LUT-delay model can be found in time where is the number of edges in . For each , we compute , and the label of node can be determined by (1) in 8 Our implementation can extract the sorted height array
time based on merge sort. time, where is the number of edges in . Applying the label computation for each node in according to the topological order in the labeling phase, we have:
Theorem 1: Given a network with nodes and edges, the labeling process can be done in time.
B. Mapping Phase
The second phase of the HeteroMap algorithm generates a delay optimal mapping solution based on the cuts computed in the first phase. This phase is similar to that in [4] , except that the LUTs generated in our mapping solution can be heterogeneous. We maintain a list of nodes that have to be implemented by LUTs. Initially, contains those nodes that have fanouts to the PO nodes. Then, we repeatedly remove a node from , create the LUT based on computed in the labeling phase, and add into all the nodes in whose LUTs have not yet been created. The mapping phase ends when only contains PI nodes. The entire second phase takes linear time. It is not hard to see that the delay from any PI to the output of is no more than in the resulting mapping solution. Therefore, the mapping solution is optimal. The nodes that are never added into do not need to be implemented since they are completely covered by the LUTs implementing other nodes.
The HeteroMap algorithm is summaried in Fig. 5 .
C. Area Minimization in the HeteroMap Algorithm
The secondary objective of the HeteroMap algorithm is area optimization, which is considered by maximizing the volume of each cut during the mapping process and by the post-mapping heterogeneous predecessor packing operations. In general, the minimum-height -feasible cut computed in the labeling phase is not unique. Intuitively, the larger , the more nodes we can pack into the -LUT , which leads to fewer LUTs in total.
Therefore, our algorithm maximizes the volume of each cut during the minimum-height -feasible cut computation. According to the description in Section III-A, the minimum-height -feasible cut computation is reduced to a series of min-cut [13] computations. Therefore, HeteroMap uses the approach proposed in [4] , which finds the maximum volume min-cut with the same complexity as that of computing a min-cut. Note that, however, the resulting minimum-height -feasible cut may not be the maximum volume minimum height -feasible cut since only the maximum volume minimum height min-cut is computed during each cut computation. In order to further reduce the area, HeteroMap extends the predecessor packing operation, which was first introduced in [3] for homogeneous mapping solutions, to minimize the number of -LUTs in the heterogeneous mapping solution, where one LUT capacitity can be different from another. If a -LUT ( or ) has a fanin -LUT ( or ), has only one fanout, and , then can be merged into such that the LUT implementing can be saved. This operation is carried out in the mapping solution of HeteroMap in the reverse topological order from PO to PI, and ends when no such packing operations can be applied.
IV. DELAY-ORIENTED TECHNOLOGY MAPPING FOR HETEROGENEOUS FPGAs WITH BOUNDED RESOURCES
In this section, we shall study the delay-oriented technology mapping problem for heterogeneous FPGAs with bounded resources, where some of s have finite values as defined in Section II. Section IV-A analyzes the computational complexity of this problem. To solve the problem, Section IV-B presents a pseudopolynomial time optimal algorithm for trees, while Section IV-C presents two heuristic algorithms for general networks.
A. Complexity of Problem DM-HF-BR
We shall investigate in this subsection the computational complexity of the DM-HF-BR problem. In order to simplify the description, we assume that there are two types of LUTs in a heterogeneous FPGA, -LUT without resource limitation and -LUTs ( is a variable), with delay ratio of . We first define the decision version of the DM-HF-BR problem.
Problem: Delay-bounded heterogeneous LUT mapping with bounded resources (DB-HF-BR)
Instance: Three integers, and , two real numbers, and , and a -bounded Boolean network .
Question: Under the heterogenoeus LUT-delay model with and , is there a mapping soultion of with any number of -LUTs and no more than -LUTs, which has delay no more than ?
We shall first show that the DB-HF-BR problem is NP-complete for . The proof of the NP-completeness for the DB-HF-BR problem is based on the polynomial time transformation from the 3-Satisfiability (3SAT) problem to the DB-HF-BR problem. First, we define the 3SAT problem, which is a well-known NP-complete problem. Question: Is there a truth assignment of the variables in such that ? We shall construct a polynomial time transformation that transforms each instance of 3SAT to an instance of DB-HF-BR. We shall relate the decision of the truth assignment of a Boolean variable in an instance of 3SAT to the decision of using the -LUT implementation on a node in the corresponding network of the DB-HF-BR instance. Since determining the truth assignment is difficult, we can show that determining the -LUT implementation is also difficult. Fig. 7(a) . For the case of 6, another PI is added as another fanin to one of the literal nodes, i.e., , as depicted in Fig. 7(b) . The reason that is constructed in this way is to make sure that it is impossible to pack all the three literal nodes in to using a -LUT while it is always possible to pack two of the three literal nodes in to to form a -LUT (to be shown later).
After connecting subnetworks , with the subnetworks , the network obtained is -bounded when . It is clear that the transformation defined above takes time. The following NP-completeness proof holds for . We define packing and into to form a -LUT as pack , and packing and into to form a -LUT as pack . We now show that it is impossible to pack all the three literal nodes in to using a -LUT. If is an odd number, each of the literal nodes has fanins. If all the three literal nodes are packed into , the total number of fanins for is , which is larger than when . If is an even number, each of the literal nodes has fanins. If all the three literal nodes are packed into , the total number of fanins for is , which is larger than when . Similarly we can show that with the subnetwork constructed for 6 in Fig. 7(b) , it is impossible to pack all the three literal nodes in to . Therefore, it is impossible to pack all the three literal nodes in to using a -LUT for . However, we can always pack two of the three literal nodes in to to form a -LUT, denoted as pack (see Fig. 7 ). By linking the variable assignment of 1 for the 3SAT instance with pack , and linking the assignment of 0 with pack , we can show the following.
Theorem 2:
is satisfiable if and only if has a mapping solution of delay using -LUTs and no more than -LUTs. The proof is given in the Appendix to this paper. Corollary 1: DB-HF-BR is NP-complete for . Proof: DB-HF-BR is in NP because given any mapping solution with -LUTs and -LUTs, we can easily verify if its delay is bounded by . Moreover, the transformation from 3SAT instance to DB-HF-BR instance takes polynomial time. Finally, according to Theorem 2, the 3SAT question has an YES answer if and only if the DB-HF-BR question has an YES answer. Therefore, the NP-completeness of 3SAT implies that DB-HF-BR is NP-complete.
Corollary 2: The DM-HF-BR problem is NP-hard for .
Note that the construction of does not apply when , therefore, the complexity of the problem is still open for .
B. Delay Optimal Mapping for Trees
Although the DM-HF-BR problem is NP-hard for general DAGs, we shall show in this section that it can be solved optimally in pseudopolynomial time for trees using the dynamic programming technique.
Assume that there are two types of LUTs in a heterogeneous FPGA, -LUT without resource limitation and -LUTs , with delay ratio of . Given a tree , we want to compute the mapping solution for with minimum mapping delay under the heterogeneous LUT-delay model by using the available heterogeneous LUT resources. The algorithm is based on the cut generation for trees. Assume that the root of has fanin nodes . Let denote the subtree in rooted at . Clearly, any cut of size in induces an -cut of , with , and vice versa. Let denote the set of cuts of size in , and define . It was shown in [5] that (4) where the union is on all combinations of such that . Therefore, all the cuts of size in a tree can be generated based on the recursive equation (4), and the number of cuts generated according to (4) is bounded by a constant depending only on , which is the th Catalan number [13] , denoted , where . The total number of -feasible cuts in a tree is thus bounded by . Let be the minimum delay of node with -LUTs used in the mapping solution of , we want to compute for each , for each node in from leaves to root in topological order using dynamic programming. The topological order guarantees that every node is processed after all of its predecessors have been processed. For each node , we first generate all -feasible cuts in , which include all -feasible cuts in as well. Note that node will be implemented either by -LUT or -LUT. We first assume that is implemented by -LUT. For each -feasible cut in , we enumerate all the -LUTs' distributions among the trees rooted at the cut nodes of this -feasible cut , then obtain the minimum delay of using this -feasible cut by the following formula:
Through checking all the -feasible cuts in , we can compute for each , which is the minimum delay of node with -LUTs used in the mapping solution of if is implemented by an -LUT. In a similar way, by assuming that is implemented by -LUT, we check eachfeasible cut in and enumerate all the -LUTs' distributions among the trees rooted at the cut nodes of this -feasible cut , then obtain the minimum delay of using this -feasible cut by the following formula:
Through checking all the -feasible cuts in , we can compute for each , which is the minimum delay of node with -LUTs used in the mapping solution of if is implemented by an -LUT. Therefore, for each . For the root gives the minimum mapping delay of using -LUTs and no more than -LUTs.
Based on the above description, for every node in and every -feasible cut in , it takes time to compute all as in order to compute each , maximally time is needed to enumerate all the -LUTs' distributions among the trees rooted at the cut nodes of this -feasible cut ; for every -feasible cut in , it takes time to compute all . Since the total number of -feasible cuts in the tree rooted at every node is bounded by , where is the th Catalan number, the complexity of the above algorithm is , where is the number of nodes in .
It is not hard to see that this algorithm can be easily extended for heterogeneous FPGAs with types of LUTs of which have resource limitations, specified by -LUTs, -LUTs and -LUTs, respectively. We shall then compute , for each node in tree , instead of , and the complexity becomes , where . This algorithm is considered to be polynomial if the sizes of the heterogeneous LUTs are taken as the constants. Therefore, the DM-HF-BR problem can be solved optimally in pseudopolynomial time for trees. This result, however, is mainly of theoretical interest as explained in Section IV-C.
C. Delay-Oriented Mapping for DAGs
Most of the combinational circuits are general DAGs instead of trees. If we decompose the general DAG into independent trees before mapping, in order to solve the DM-HF-BR problem we have to try all possible distributions of the LUTs with bounded resources among the independent trees, which will result in very high complexity. Therefore, we do not intend to use the optimal tree mapping algorithm for general DAGs. Instead, we develop two efficient heuristics, BinaryHM (Section IV-C1) and CN-HM (Section IV-C2) to solve this problem. Both use the HeteroMap algorithm presented in Section III to compute a delay-optimal heterogeneous LUT mapping solution without resource constraints under a given heterogeneous LUT delay ratio. Given a delay ratio , HeteroMap uses as many -LUTs as necessary to achieve the minimum delay for every node in . However, if we increase the delay ratio of -LUT versus -LUT from the original delay ratio, , all the way to , we expect to see that HeteroMap will generate mapping solutions using fewer and fewer -LUTs as HeteroMap will tend to use -LUTs on the nodes which have more delay reduction with the -LUT implementation over the -LUT implementation. When the delay ratio reaches , HeteroMap will produce exactly the same mapping solution as FlowMap, as using any -LUT will not lead to better mapping solution. For example, with 4 and 11, Fig. 8 shows how the number of 11-LUTs used in the mapping solution of HeteroMap decreases with the increase on the delay ratio of an 11-LUT and a 4-LUT. Assuming that the original delay ratio of an 11-LUT versus a 4-LUT is four, by using this delay ratio, the mapping delay achieved by HeteroMap is the lower bound of . Fig. 9 shows how the overall circuit delay obtained by HeteroMap (the circuit delay is computed by using the real delay raio) changes when we increase the delay ratio which results in the decrease of the number of 11-LUTs used in the mapping solution.
Therefore, by doing a binary search on the delay ratio of -LUT versus -LUT within the range , BinaryHM will eventually find a mapping solution which uses no more than -LUTs and has the minimum mapping delay [in the range specified by (7)].
For a netlist with nodes and edges, the length of the delay ratio range is at most . Thus, if the granularity of the binary search over the delay ratio of -LUT versus -LUT is selected to be , BinaryHM will go through the HeteroMap algorithm for times. Therefore, the complexity of the BinaryHM algorithm is . For the experimental results reported in Section V-B, the value of is set to be 0.1.
2) The CN-HM Algorithm: Our second heuristic, named CN-HM, is a post-mapping approach. Given an original unmapped network, FlowMap is first applied to map into a -LUT netlist of the minimum mapping delay. Then in the post-mapping procedure, CN-HM further minimizes the circuit delay using no more than -LUTs by focusing on the most delay-critical nodes in for each circuit delay target. Let be the minimum mapping delay of . Similar to BinaryHM, we first determine the range of . Let be the current delay of , in which each node is a -LUT. Let be the minimum mapping delay of if there is no constraint on the number of -LUTs used on . is obtained by performing a labeling procedure, similar to that in HeteroMap, on , except that for each node in , the possible -LUT implementation on is itself since CN-HM is a post-mapping approach. Clearly, is the upper bound of . is the lower bound of , as is obtained by using as many -LUTs as necessary to minimize the delay for every node in . Therefore, we have (8) CN-HM performs a binary search over the interval . For each circuit delay target during the binary search, CN-HM identifies all the critical nodes in , and some of these critical nodes' delays have to be reduced in order to achieve the overall circuit delay target . These critical nodes altogether form a critical graph , with these critical nodes as the vertices and their interconnections in (critical paths) as edges. The concept of critical graph (or network) has been used extensively in speed-up, a well-known timing optimization algorithm for combinational logic [18] . The critical graph represents the most timing critical portions of the logic. By considering the critical graph for each circuit delay target, CN-HM can avoid consuming the limited number of -LUTs on noncritical nodes during the circuit delay minimization.
Observation 1: has a mapping delay of no more than using no more than -LUTs only if the size of the minimum cut in is no more than . CN-HM first checks whether a delay target can or cannot be obtained according to the necessary condition depicted in Observation 1 before any further operation is performed. If the necessary condition is not met, CN-HM will locate a less agressive delay target through the binary search. If the condition is met, CN-HM will perform a HeteroMap labeling procedure on the critical nodes in . Then, similar to BinaryHM, CN-HM performs a binary search on the delay ratio of -LUT versus -LUT with the range from the original delay ratio of to , and applies HeteroMap with the given delay ratio to check whether no more than -LUTs can be used in such that the delay of is bounded by . For a mapped netlist with nodes and edges, the delay target range is at most . If the granularity of the binary search over the circuit delay target is set to be and the granularity of the binary search over the delay ratio of -LUT versus -LUT is set to be , the complexity of the CN-HM algorithm will be . For the experimental results reported in Section V-B, is set to be 1.0 and is set to be 0.1. In summary, BinaryHM operates on the original unmapped circuit and performs a binary search on the delay ratio of -LUT versus -LUT such that HeteroMap can eventually obtain the mapping solution with a feasible number of -LUTs used and the circuit mapping delay minimized. Instead, CN-HM takes the mapped circuit with each node as a -LUT and performs a binary search on the minimum delay of the circuit to get a delay target at each time. For each delay target, CN-HM identifies critical nodes and again performs a binary search on the delay ratio of -LUT versus -LUT, such that HeteroMap can check whether the delay target is feasible or not.
V. EXPERIMENTAL RESULTS AND COMPARATIVE STUDY
We have implemented HeteroMap, BinaryHM, and CN-HM in the C language on the SUN ULTRA SPARC Workstation and integrated them into the RASP system developed at the University of California at Los Angeles (UCLA) [8] . We present the experimental results of HeteroMap on heterogeneous FPGAs without bounded resources in Section V-A and the experimental results of BinaryHM and CN-HM on heterogeneous FPGAs under resource constraint in Section V-B.
A. Delay Optimal Technology Mapping for Heterogeneous FPGAs Without Bounded Resources
We first compare FlowMap [4] and HeteroMap on XC4000 series FPGAs, which can implement circuits with 4-LUTs and 5-LUTs with delays equal to 1.0 and 1.5, 9 respectively. For FlowMap, we set 5 such that after mapping, the fiveinput nodes will be implemented in 5-LUTs, and the remaining nodes will be implemented in 4-LUTs. The mapping comparison results between FlowMap and HeteroMap are summarized in Table I . The " " in columns 2 and 7 is the critical path delay in the mapped network under the heterogeneous LUTdelay model. The PLB number in columns 3 and 8 comes from packing every two 4-LUTs into one PLB and leaving all 5-LUTs as independent PLBs. "A-mean" and "G-mean" represent arithmetic mean and geometric mean, respectively. Table I shows TABLE I  THE COMPARISON BETWEEN FLOWMAP AND HETEROMAP ON XC4000 SERIES FPGAS that HeteroMap reduces 19% of the mapping delays and 17% of the PLB numbers over FlowMap.
Match4K [8] is an intelligent post-mapping multistep matching heuristic for XC4000 device family. One of its features is to implement a 5-LUT using a partial XC4000 PLB, which can significantly reduce the PLB number in the mapping solution. To take advantage of the Match4K algorithm in XC4000 mapping, we also compare FlowMap followed by Match4k with HeteroMap followed by Match4K, and perform layout using Xilinx Alliance 1.4 FPGA development system on the mapping solutions. The comparison results are also summarized in Table I . The post-layout delays are reported by Xilinx Alliance 1.4 FPGA development system. After Match4K, HeteroMap can still reduce 6% of the post-layout delays (" ") and 8% of the PLB numbers over FlowMap (" "). Although the run time of HeteroMap is slightly higher than FlowMap, the overall compilation time is dominated by the layout design performed by the Alliance 1.4 system. Furthermore, if we impose the same set of timing constraints 10 for both FlowMap and HeteroMap mapping solutions during the layout design, it will take Alliance 1.4 system three to four times longer CPU time to meet the timing constraints on the FlowMap mapping solutions than the ones generated by HeteroMap.
A heterogeneous FPGA could consist of three or more types of logic blocks. In order to show more clearly the benefits of exploring heterogeneousness in technology mapping, we test FlowMap and HeteroMap on a hypothetical heterogeneous FPGA architecture which consists of three types of LUTs with input sizes four, five, and six, and the delays 1.0, 1.25, and 1.5, respectively. For FlowMap, we set such that after mapping each six-input node is implemented by one 6-LUT, each five-input node by one 5-LUT, and all remaining nodes by 4-LUTs. The results are shown in Table II . In order to compare the area in the mapping solution, let area of area of . In Table II  we set based on the SRAM cell count, which assumes that the area of each 4-LUT, 5-LUT and 6-LUT is one, two, and four, respectively. Under this assumption, HeteroMap reduces 11% of the mapping delays and 20% of the area, when compared with FlowMap. In general, the area ratio of may not be 2 due to the additional logic and interconnects in a PLB. To make a more accurate area analysis, we perform a sequence of tests with and plot the results in Fig. 10 to show the area comparison between FlowMap and HeteroMap. HeteroMap outperforms FlowMap on area when , and is always better than FlowMap in terms of the delay.
B. Delay-Oriented Technology Mapping for FPGAs with EMBs
We tested BinaryHM and CN-HM on MCNC benchmarks on Altera FLEX10K device family [2] , which can be taken as the heterogeneous FPGAs with 4-LUTs and a limited number of 11-LUTs. The technology mapping comparison results are summarized in Table III , where the BinaryHM and CN-HM algorithms are compared with FlowMap [4] which only uses 4-LUTs, and HeteroMap which uses both 4-LUTs and 11-LUTs, and the number of 11-LUTs used by HeteroMap could exceed the resource bounds. In FLEX10K devices, the delay ratio between 4-LUT and 11-LUT is 1:4. For BinaryHM and CN-HM, the number of 11-LUTs (EMBs) available (" ") is determined by the smallest FLEX10K device into which this circuit can be fitted. The experimental results show that compared with FlowMap using only 4-LUTs, both BinaryHM and CN-HM can reduce more than 20% of the circuit mapping delays (" ") and 27% of the 4-LUT area ("4-LUT") by making efficient use of the available heterogeneous LUTs. Moreover, in order to meet the resource constraints, BinaryHM and CN-HM consume much fewer 11-LUTs (" ") than HeteroMap does. Although for some circuits, not all the available 11-LUTs are used for delay minimization in BinaryHM and CN-HM, they can be used later on by EMB Pack, an algorithm proposed in [9] , to further minimize the circuit area while maintaining the circuit delay. As an example, we also showed in Table III the circuit area and the number of EMBs used eventually by CN-HM followed by EMB Pack ("CN-HM EP") as the postprocessing. The number of 4-LUTs in the mapping solutions obtained by the Altera FPGA development system MAX PLUS II 9.01 [2] is shown in the last column of Table III ("MAX") . From Table III, we clearly see that through the effective use of EMBs under the resource constraints both BinaryHM and CN-HM reduce the delay and area of the mapping solutions considerably over the state of the art of academic algorithms and commercial tools.
The CPU time comparison is summarized in Table IV. From  Tables III and IV we can see that both BinaryHM and CN-HM are fairly efficient, using less than 40 min of CPU time for all 13 benchmarks ranging from 200 gates to 3000 gates (four of them have over 2000 gates).
In order to show the effectiveness of our algorithms on the final circuit layout delay, we use the FPGA development system MAX PLUS II 9.01 [2] to perform layout on all the mapping solutions obtained from FlowMap, MAX PLUS II itself, BinaryHM and CN-HM, and sumarize the results in Table V . The experimental results show that both BinaryHM and CN-HM are able to reduce 12% and 6% of the circuit layout delays (" ") over FlowMap and MAX PLUS II, respectively.
VI. CONCLUSIONS AND FUTURE WORK
In this paper, we studied the technology mapping problem for heterogeneous FPGAs with or without bounded resources under the objective of delay optimization. We presented the first polynomial-time delay optimal technology mapping algorithm, named HeteroMap, for heterogeneous FPGAs without bounded resources. We further showed that the delay minimization technology mapping problem for heterogeneous FPGAs with bounded resources is NP-hard for general networks, but can be solved optimally in pseudopolynomial time for trees. We then presented two heuristic algorithms, named BinaryHM and CN-HM, to solve this problem for general networks. HeteroMap, BinaryHM and CN-HM produced favorable results on MCNC benchmarks on commercial heterogeneous FPGAs.
From the experimental results of the HeteroMap algorithm for XC4000 series FPGAs, we can see that the post-layout delay is not reduced as dramatically as the mapping delay over the FlowMap algorithm. One of the main reasons is that the interconnection delay, in addition to the LUT delay, ought to be appropriately modeled and optimized during mapping in order to achieve the fullly optimized FPGA design. This will be an important future direction in our research. Also notice in Table III that our algorithems are unable to use the available EMB resources for delay minimization on circuits des and ex5p. One TABLE III  MAPPING COMPARISON AMONG FLOWMAP, HETEROMAP, BINARYHM, AND CN-HM ON FLEX10K DEVICE FAMILY   TABLE IV  CPU COMPARISON AMONG FLOWMAP, HETEROMAP, BINARYHM, AND  CN-HM ON FLEX10K DEVICE FAMILY of the reasons is that these two circuits are large in size (compared to the other benchmarks we use) but small in depth, which makes it difficult for our algorithms to minimize their delays. We plan to generalize the algorithms to utilize the multipleoutput configurations of EMBs. This is expected to be more effective in terms of delay reduction using EMBs for "wide" circuits like des and ex5p. We believe that in order to obtain high density and high performance, heterogeneous FPGAs with or without resource constraints are the future trend of the FPGA architecture development. We shall continue working on technology mapping for heterogeneous FPGAs and extend our algorithms to handle more general delay models in heterogeneous FPGA designs, which will take both interconnection delays and the LUT delays into consideration. We also expect to use our mapping algorithms to evaluate different types of heterogeneous FPGA architectures to achieve better performance and area utilization.
APPENDIX NP-COMPLETENESS PROOF OF THE DB-HF-BR PROBLEM
We start with a lemma about the transformation from a satisfiable 3SAT instance to a delay-bounded DB-HF-BR instance. contains at least one literal with value one, corresponding to a literal node in subnework whose variable node in some subnetwork is packed with its predecessors, and the delay for this variable node is . Therefore, pack , is performed to pack the literal nodes (at most two), which have delays larger than since their variable nodes' delays are larger than , into to reduce the delay of to no more than . We now compute the delay of the mapping solution. There are three types of paths from the PI nodes to the PO nodes in .
• Paths entirely inside . Since exactly one of pack and pack is performed, the delay of is either or , both of which are no more than . Therefore, the total delay along the path is no more than .
• Paths entirely inside . No such paths have delay more than .
• Paths from to . According the above analysis, pack , is performed to pack the literal nodes (at most two), which have delays larger than , into to reduce the delay of to no more than . Therefore, the delay of is no more than .
In summary, our mapping solution of , constructed based on the truth assignment that satisfies , has maximum delay of . In order to derive a truth assignment that satisfies from a delay-bounded mapping solution of using -LUTs and no more than
-LUTs, we first analyze the characteristic of such a mapping solution.
Lemma 6: In a mapping solution of with delay bound of using -LUTs and no more than -LUTs, exactly one -LUT is used on only one of and for each . Proof: First, in order to maintain the delay bound of over each , there must be at least one -LUT used in each . As the mapping solution uses no more than -LUTs, there is exactly one -LUT used in each , and the only possibility to use -LUT is on one of and . An interesting observation is that if for some , neither pack nor pack is performed, which means that one -LUT is used for or instead of or , then the delay of will be . Therefore, no matter how the pack operation is performed on the corresponding clause subnetworks which have as their variable node, the delay of the POs in those subnetworks will be no less than , which is larger than . Therefore, if for some , neither pack nor pack is performed, does not have any fanout, implying that is not in any clause of .
Lemma 7: In a mapping solution of with delay bound of using -LUTs and no more than -LUTs, for each , at least one of the operations pack pack and pack must be performed, where and are the variable nodes for clause .
Proof: If for some clause , none of the above operations are performed, all the variable nodes of this clause will have delays larger than such that all the three literal nodes have delays larger than . The delay of , hence, will be larger than , no matter how pack is performed. Therefore, at least one of these operations must be performed.
According to Lemmas 6 and 7, if a mapping solution using -LUTs and -LUTs satisfies the delay bound of , it will be similar to the one we constructed in the proof of Lemma 5.
Lemma 8: If has a mapping solution of delay using -LUTs and no more than -LUTs, then is satisfiable. Proof: The truth assignment for is constructed as follows. For each variable of , if pack is performed in the mapping solution, we assign ; otherwise we assign . For any clause , according to Lemma 7 , there is at least one variable node, on which the pack operation is performed and whose delay is , which implies that is satisfied. Therefore, every clause of is satisfied by the assignment.
