Abstract-In
I. INTRODUCTION
T HE FIELD programmable gate array (FPGA) was introduced in the mid-1980s as an alternative for the implementation of application-specific integrated circuits (ASICs). In contrast to the cell library technology and the mask-programmable gate array technology for ASICs, an FPGA does not need to go through the fabrication process for circuit implementation and is field programmable and often field reprogrammable. Although the FPGA in general has a lower gate density and slower circuit speed, its advantages of programmability, shorter design turnaround time, and lower initial nonrecurring engineering cost (good for low to medium volume production) often offset its disadvantages. A wide range of applications has been developed using FPGAs, including fast ASIC implementation, rapid system prototyping, logic emulation, and reconfigurable computing. FPGAs consist of three kinds of programmable elements: programmable logic blocks (PLBs), routining resources, and input-output (I/O) blocks. Each logic block contains combinational components such as multiplexers (MUXs), simple gates (e.g., AND and OR), programmable lookup tables (LUTs), and sequential components such as flip-flops. Routing resources include segmented interconnects and switching blocks. The segmented interconnects connect to the inputs and outputs of logic blocks while the switching blocks link the segments to form long routing tracks to implement routing topology. The I/O blocks can be programmed to become the primary inputs (PIs) or primary outputs (POs) of the circuits on FPGAs.
LUTs are the basic logic blocks in many FPGAs today. A -input LUT ( -LUT) consists of static random access memory (SRAM) cells that can store the truth table of an arbitrary -input function. In many FPGAs, small LUTs are connected by fast local connections to form a PLB for implementation flexibility, better performance, and better utilization of silicon area. In contrast, some FPGAs use MUX-based logic blocks or product-term-based logic blocks. These blocks, although having more than input ports, cannot guarantee an implementation for an arbitrary -input function. New universal logic module (ULM)-based FPGA logic blocks [34] had been proposed for better covering of -input functions, but the coverage was still incomplete (99% of four-input functions using an eight-input ULM). Since LUT is widely used in today's major FPGAs and is a true ULM for functions of its input size, we focus on implementing functions using LUT-based PLBs in this paper.
A PLB can often implement one arbitrary -input function, where is determined by the PLB architecture or some wide function of more than inputs. Unfortunately, it is generally a difficult problem to determine if an arbitrary given wide function can be implemented by a PLB. This is called the Boolean matching for PLB problem. Most existing technology mapping algorithms first produce a -LUT mapping solution, then pack LUTs into PLBs [9] , [10] , [18] , [19] , [21] , [27] , [30] , [31] , [39] . A comprehensive survey of recent FPGA technology mapping approaches can be found in [11] .
In this paper, we study Boolean matching for PLB problems. We focus on two classes of (LUT-based) PLBs: PLB1 and PLB2, as shown in Fig. 1 the output signals of , , and are denoted as , , and , respectively. For PLB1, the 3-LUT H has inputs , , and external signal . For PLB2, the MUX H selects either or to output depending on the value of external selection signal . The LUT sizes and and the appearance of are architecture parameters that may vary from one PLB to another. Therefore, we shall represent PLB1 as PLB1 , where the first two parameters are the sizes of LUTs and , and if the line exists, otherwise . Similarly, we shall represent PLB2 as PLB2
. Under this notation, is identical to the logic block of Xilinx XC4K series FPGAs [38] , which is called the configurable logic block (CLB), while is the logic block of Lucent's ORCA series FPGAs [33] , which is called the programmable function unit, and is the Xilinx XC5200 CLB. When there is no confusion in the context, we refer to the two PLBs as PLB1 and PLB2.
Boolean matching for PLBs may lead to significant reduction on mapping area and circuit delay. For example, consider a six-variable function (in the Xilinx test suite used in Section VI-A), which is represented by the following sum-of-product form:
We compare three implementations of produced by Chortle-crf [19] , FlowMap [9] , and Boolean matching algorithms, using XC4K CLBs. For the first two implementations, we applied the optimization script rugged in the sequential circuit synthesis system SIS [32] , mapped the resulting gate-level network using the area-oriented mapper Chortle-crf and the depth-optimal mapper FlowMap respectively, and packed the resulting LUTs into CLBs using the efficient CLB packing procedures in [15] . The CLB networks that result from Chortle-crf and FlowMap have three levels with seven and six CLBs, respectively. However, a Boolean matching approach can obtain a single CLB implementation of ( Fig. 2 ) with each LUT implementing the following function: In this implementation, and are bridged inputs of LUTs and (i.e., and are shared by and ) and is a bridged input of LUTs and . Prior to this paper, it was an open problem of how to perform Boolean matching for PLBs, especially when bridged inputs are taken into consideration.
Most existing Boolean matching approaches are for ASIC design synthesis using cell libraries. A good survey can be found in [3] . Very few are targeted for LUT-based logic blocks. Boolean matching approaches were proposed in [4] and [40] for Actel's MUX-based FPGAs [1] . Their approaches cannot be applied to LUT-based PLBs directly. Mapping algorithms targeted for PLBs were proposed in [8] , where linear programming was employed to compute LUT covers and PLB packings simultaneously. Functional decomposition-based mapping approaches [20] , [24] , [26] , [37] were proposed for LUT network synthesis. None of them were targeted to implementing wide functions using PLBs. A recent work [29] studied bidecomposition of Boolean functions and applied the results to Boolean matching for some LUT-based PLBs. The results were limited. For example, the researchers were unable to solve the Boolean matching problem for the XC4K CLB completely. In this paper, we present new Boolean matching methods for PLBs. Our methods are based on classical and new functional decomposition techniques and provide a more general solution to the Boolean matching problem for LUT-based PLBs. For example, our results give exact solutions for matching functions to the XC4K CLBs. We apply our techniques to quantitative evaluation of PLB architectures (in terms of logic implementation capability) as well as to technology mapping for FPGAs that are widely used today. Our Boolean matching approaches for PLB1 and PLB2 may be extended to other LUT-based PLBs.
This paper is organized as follows. Section II formulates the Boolean matching problem. After the introduction of classical and recent functional decomposition results in Section III, Boolean matching methods for PLB1 and PLB2 are presented in Sections IV and V, respectively. Section VI reports Boolean matching and architecture evaluation experimental results. In Section VII, we present technology mapping algorithms that employ our Boolean matching methods and report the experimental results in Section VIII. Section IX concludes the paper. Preliminary results of this study were presented in [13] and [14] .
II. PROBLEM FORMULATION
Given a multilevel network of logic gates, combinational logic synthesis transforms the given network into a network of PLBs. This transformation usually includes two major steps: 1) logic optimization and 2) technology mapping. Logic optimization transforms the given network into an equivalent network that is suitable for mapping into PLBs, e.g., into a network of fewer gates and/or smaller gates. Technology mapping then transforms the resulting network into a PLB network of minimal cost, where the cost could be network area of delay. Conventional technology mapping usually include covering the gate-level network with LUTs and packing LUTs into PLBs. The FlowMap and the Chortle-crf algorithms are two widely used technology mapping algorithms. However, as shown in the previous section, they could generate suboptimal solutions compared to Boolean matching approaches. In this paper, we focus on the Boolean matching for PLBs and then apply our results to technology mapping.
For each LUT-based PLB architecture, we define the characteristic number of the PLB to be the largest number such that any function of or fewer variables is realizable by the PLB. Functions of more than variables are called wide functions with respect to the PLB. Clearly, if the PLB has more than inputs, it can implement some wide functions. For example, if in PLB1 and PLB2, then either PLB can implement any function of up to five variables or some wide function of up to nine variables. The Boolean matching problem for PLB is to determine, for a given wide function with respect to the PLB, whether the function can be implemented by the PLB.
Boolean Matching for PLB: Given a wide function with respect to a PLB , determine if can be realized by a single PLB .
In general, Boolean matching considers input negation and/or permutation, output inversion, bridging of inputs, and constant assignments to some inputs. For LUT-based PLBs, however, input negation, input permutation and bridging in one LUT, output inversion, and constant assignments to unused inputs do not affect the matching feasibility. Therefore, only input partitioning and input sharing among LUTs are relevant factors to our Boolean matching for PLB problem. For any implementation of on PLB1, existence of bridged inputs implies nondisjoint functional decomposition or bidecomposition of or a combination of them. For PLB2, existence of bridged inputs implies the two cofactors share common variables.
B. Existence Conditions
We now briefly review the existence conditions for various forms of functional decomposition. Ashenhurst [2] gave the existence condition for simple disjoint decomposition. For nondisjoint bidecompositions of , it was shown in [29] that they can be obtained by applying disjoint bidecomposition to the cofactors of . In fact, a few forms of nondisjoint decomposition that correspond to different PLB input bridging patterns can be obtained in a similar way. Because of this, we shall defer to the next section to present the existence condition when we introduce the Boolean matching approach for each PLB input bridging pattern.
A few approaches for partially dependent decomposition and nondisjoint decomposition were proposed in the past few years [13] , [20] , [25] , [26] , [28] , with different perspectives and algorithmic characteristics. In particular, an existence condition for both partially dependent decomposition and nondisjoint decomposition was given in [13] . The condition can be used to compute one partially dependent encoding function or nondisjoint variable efficiently. For finding multiple partially dependent encoding functions, the authors in [13] took an approach similar to that in [36] and [37] . This proves the theorem.
IV. BOOLEAN MATCHING APPROACHES FOR PLB1
In this section, we consider matching wide functions to PLB1 in Configurations A, B, C, and D shown in Fig. 3 obtained from different bridging status of input . In Configuration A, does not feed to LUT ; in Configuration B, feeds to LUT only; in Configuration C, feeds to all three LUTs, while in Configuration D, feeds to LUTs and . For each configuration, there might be bridged inputs to LUTs and , shown as dash lines in Fig. 3 . There can be multiple bridged inputs to and as long as the total number of distinct inputs matches that of the wide function. For example, up to two bridged inputs to and are allowed when we consider matching a six-variable function to a XC4K CLB in Configuration D. It should be clear that the four configurations exhaust all possible ways of using PLB1. Hence, the Boolean matching problem for PLB1 can be solved by matching functions to the four configurations individually. , and . To achieve efficient computation, functional decomposition operations are performed using ordered binary decision diagrams [5] in our implementation, as in approaches such as [6] , [25] , and [26] . In Case 2, a simple disjoint decomposition of exists under the bound set , which does not match directly to configuration B. However, it is possible to replace a large encoding function with two small encoding functions. In particular, we proved that a function can be represented as when the following condition holds. Note that Cases 1 and 2 cover all possible implementations of on PLB1 in Configuration B. Therefore, using the partially dependent decomposition algorithm in [13] and Theorem 7, we can determine if a wide function can be implemented on PLB1 in Configuration B. Note that is not excluded in both cases. Therefore, bridged inputs to LUTs and have been considered implicitly. To test the matching of a function to Configurations C of PLB1, we enumerate from and compute bidecompositions for the cofactors and such that the condition in Theorem 8 is satisfied. Because feeds to LUTs , , and , the existence condition for Configuration C is less constrained compared to Configurations A and D (Section IV-D).
A. PLB1 in Configuration A

B. PLB1 in Configuration B
C. PLB1 in Configuration C
D. PLB1 in Configuration D
If PLB1 in Configuration D implements
, a decomposition of must exist with . Other bridged inputs to LUTs and may exist. We prove the following result. . Both bidecompositions are then tested for the condition in Theorem 9. Note that we do not derive four bidecompositions, as in the matching to Configuration A, because it is and under comparison rather than and . A special case that requires particular attention is when one of the cofactors is a constant (zero or one). For example, assume that is a constant. Then, we can always obtain a decomposition of by duplicating for after obtaining the bidecomposition of followed by producing such that its value does not depend on the input (e.g.,
, where is the constant ). Therefore, the condition in Theorem 9 is always satisfied for the special case. When both cofactors are constant, we set both and to constant and construct and accordingly to obtain a decomposition.
If there is no bridged input between LUTs and (i.e., ), we may also use the following theorem for efficient matching to Configuration D. If: Since , the existence of simple disjoint decomposition of implies the existence of a simple disjoint decomposition of under the bound set , which in turn implies the existence of decomposition form for . This proves the sufficient condition.
Using Theorem 10 is more efficient than using Theorem 9 for matching to Configuration D when there is no bridged inputs between LUTs and . We first compute a simple disjoint decomposition under and then identify a nondisjoint variable in . If both are successful, we compute the whole decomposition. Since computing a simple disjoint decomposition and identifying a nondisjoint variable can be performed efficiently, we save runtime for the case when .
V. BOOLEAN MATCHING FOR PLB2
Boolean matching for PLB2 is simpler than that for PLB1 (but PLB2 is not as powerful as PLB1 for implementing wide functions as shown in Section VI). It is easy to see that bridging to LUTs or does not help in matching wide functions to PLB2. Such a matching can be obtained by performing Shannon expansions.
Theorem 11: PLB2 can implement if and only if a Shannon expansion can be obtained for some . Therefore, given a wide function, we enumerate every input as the MUX selection signal and check if the supports of two cofactors contain no more than and variables, respectively. Once the constraints are met, we obtain a matching for PLB2.
VI. BOOLEAN MATCHING-BASED ARCHITECTURE EVALUATION
We applied our Boolean matching approaches in two experiments. First, we employed them to map 1868 benchmark circuits provided by Xilinx, Inc. All circuits are known to be implementable in one XC4K CLB, but only up to 76% of them were mapped successfully with Xilinx internal tools or any other commercial FPGA tools [23] . Using Boolean matching techniques developed in this paper, we were able to achieve 100% single CLB implementation for this set of circuits. Second, we employed the matching approaches in the evaluation of four different PLB architectures (shown in Fig. 5 ) based on the percentage of wide functions (extracted from MCNC benchmarks) that can implemented with each PLB. Our quantitative approach can help design better FPGA logic blocks.
A. Boolean Matching for XC4K CLB
Since
, wide functions are functions with six to nine variables for the XC4K CLB. We refine the XC4K CLB architecture into the configurations a to h shown in Fig. 4 , with respect to the input sizes of six, seven, eight, and nine. For example, the configurations 6.b and 6.c are instances of Configurations A and C, respectively. Bridged inputs are explicitly shown in Fig. 4 . There are as many as eight configurations for circuits with six inputs while there is only one configuration for circuits with nine inputs. Note that the configurations are not exclusive. For example, configurations 6.a, 6.b, 6.c, and 6.d are increasingly more capable in implementing six-input functions (but also require longer and longer runtime).
Refined configurations that involve multiple bridged inputs (e.g., 6 .g) may combine some Configurations A, B, C, and D. Therefore, matching functions to configurations of multiple bridged inputs may require a sequence of functional decomposition operations described in Section IV. For example, configuration 6.d is a combination of Configurations A and C. In order to implement wide functions in configuration 6.d, we apply matching procedures for both Configuration C (based on Theorem 8) decomposed in a way that matches configuration 6.d. Clearly, matching to configuration 6.g (which involves three bridged inputs) is the most time-consuming procedure, while matching to configurations without bridged inputs (6.a, 7.a, 8.a, 8.b, and 9.a) employs only simple disjoint decompositions. It is worth noting that configurations 6.h and 8.d are instances of Configuration B to which partially dependent decomposition is employed for a match. Since bridged inputs are taken into account implicitly in partially dependent decomposition, there is no need for computing cofactors using Shannon expansions with bridged inputs. Initially, 20 out of 1868 Xilinx benchmark circuits had five or less inputs. After applying the SIS rugged script, we obtained an additional 292 circuits of five or less inputs. The remaining 1576 circuits consist of 393, 371, 423, and 389 circuits of six, seven, eight, and nine inputs, respectively. Each circuit was then matched to the refined configurations in alphabetical order (i.e., 6.a before 6.b before 6.c, etc.) so that implementation of the least bridged inputs could be obtained. The results are presented in Table I . Note that 30 circuits of six inputs are matched to configuration 6.g which involves three bridged inputs. 
B. PLB Architecture Evaluation
In this section, we present our evaluation on four PLB architectures which are variations in PLB1 and PLB2 families. Their diagrams are shown in Fig. 5 , where (a) XC4K CLB is ; (b) could have different sizes of 
LUT
but LUT has four inputs; (c) has a degenerated LUT of wire connection; and (d) does not have the line and LUT has four inputs. We evaluate PLBs based on the number of wide functions (extracted from MCNC benchmarks) that each PLB can implement and the number of SRAM bits in LUTs. The bits in LUTs (for storing truth tables) are called LUT bits in the sequel. Our approach is as follows. First, we compute for each node the complete set of seven-feasible cuts [16] (where each cut corresponds to the inputs to a supernode at the node). The number of cuts largely depends on the size of the circuit. For example, there are 285, 422, and 734 instances of five cuts, six cuts, and seven cuts in 5xp1, while there are 28 875, 65 245, and 157 028 instances of five cuts, six cuts, and seven cuts in des. Second, for each cut (i.e., supernode), we compute its function and match the function to PLBs. We say a cut can be implemented by a PLB if the corresponding function can be matched to the PLB. We report the percentage of successful implementation of cuts in Tables II  and III . Finally, we divide the number of implemented cuts by the number of LUT bits for each PLB to represent PLB functional capability. In other words, we measure the efficiency of LUT bits in wide function implementation. Our measurement, of course, is only one aspect of PLB architectures. Other important factors such as required routing resources are not taken into account in the evaluation. Nevertheless, we think that it is important to see the capability of various PLBs in implementing wide functions.
We match six cuts to XC4K CLB in Configurations A, B, C, and D, and report the average percentages of matched cuts in Table II . Additionally, we report the results on Configuration A while disallowing input bridging (A-br) to see its impact on wide function implementation. Comparing the results of Configurations A versus A-br as well as the results of Configura- tions C versus D, we see the percentages of matches increase substantially when the inputs to LUTs , , and are allowed to bridge with each other. Also, we notice that each of Configurations A, B, and C alone can implement over 90% of six cuts in the MCNC benchmarks. Overall, 99% of six cuts (all) can be implemented using the XC4K CLB.
For the implementation of seven cuts on XC4K CLBs, it is interesting to see that Configuration B (based on partially dependent decomposition) is more capable than other configurations (based on bidecompositions). Overall, 92% of seven cuts can be implemented using the XC4K CLB.
In the evaluation of architectures, we consider three configurations: , , and . In other words, we fixed LUT G as a 4-LUT and considered LUT F as a 3-LUT, 4-LUT, and 5-LUT, respectively. Note that it is not guaranteed that a single can realize an arbitrary five-input function. However, experimental results in Table III show that implements 98% of five cuts. Experiments also show that and implement only 5% and 8% of six cuts, respectively, while implements 98% of 6-LUTs. It is worth noting that shrinking LUT F from 4-LUT to 3-LUT loses marginally in terms of functional capability but saves substantially on LUT bits (25%). Also, by expanding LUT F from 4-LUT to 5-LUT, PLB2 gains substantial capability on six-input function implementation with an additional 25% more LUT bits.
has the least LUT bits among the four PLBs under evaluation. Although implementation of five cuts is not guaranteed, experiments show that 96% of five-cuts can be implemented by applying simple disjoint decomposition (SD) on the five-input functions. If nondisjoint decomposition (ND) is also applied, an additional 2% of five cuts can be implemented. also implements 85% of six cuts (using SD) as well.
For , we considered two architectures and . Although implementation of five cuts is not guaranteed, we found that can implement most five cuts and of six cuts, and can implement 97% of five cuts and 89% of six cuts.
We compared these PLBs in terms of the unit-bit implementation capability (UBIC), which is defined as the number of cuts that a PLB can implement divided by the number of LUT bits in that PLB. The comparison is presented in Table IV . Among the four evaluated PLB architectures, has the highest UBIC for five cuts and six cuts. In addition, we notice that has high UBIC for five cuts and has 
VII. APPLICATION TO TECHNOLOGY MAPPING FOR FPGAS
In this section, we incorporate our Boolean matching techniques into technology mapping algorithms for depth minimization. Our mapping algorithms are targeted to the XC4K CLB and the XC5200 CLB architectures. However, they are applicable to general PLB1 and PLB2 types of FPGAs. We formulate the following problem.
Technology Mapping for XC4K/XC5200 Series FPGAs: Given a network , compute a functionally equivalent PLB network of XC4K CLBs or XC5200 CLBs such that the depth of the PLB network is minimum.
Our approaches inherit the polynomial-time FlowMap algorithm [9] that can guarantee the minimum depth in LUT mapping solutions. Given a Boolean network of logic gates that have no more than fan-ins, FlowMap first computes the minimum level for each node in all LUT mapping solutions. This level is called the LUT label of , denoted as in this paper. After computing the labels, FlowMap generates a mapping solution based on them.
Computing node labels is the key operation in FlowMap and is briefly reviewed as follows. Every PI has a minimum level of zero. Remaining node labels are computed from PIs to POs in a topological order. Let denote the subnetwork rooted at node . A cut in is a set of nodes that separates from PIs. A cut is feasible if the node cutset contains at most nodes. The height of a cut is defined as the largest label for nodes in the cut. Let be the largest label among the fan-ins of . It was shown in [9] that if there is a -feasible cut of height in , which can be verified using the max-flow min-cut algorithm. If such a cut cannot be found, then . Our mapping algorithms take similar steps to compute the minimum depth in CLB networks.
A. Technology Mapping for XC4K FPGAs
In parallel to the definition of LUT label, the minimum level of node in any CLB network is called the CLB label of , denoted
. In general, .
The largest CLB label in a network is called the CLB depth of the network. Our mapping algorithm for the XC4K CLB is called BM-Map. Before mapping, the input network is decomposed into a two-bounded network. BM-Map has three phases: initialization, labeling, and mapping. In initialization, BM-Map computes the set of all five-feasible cuts for every node . (Wide cuts are generated using five-feasible cuts for runtime consideration). In the labeling phase, BM-Map computes both and using the set . Nodes are proceeded in a topological order in both procedures. Finally, BM-Map produces a XC4K CLB network in the mapping phase.
The cutset is obtained using cut enumeration techniques in [16] . It computes -feasible cuts in by merging the cuts of the fan-ins of and rejecting those cuts that are not feasible. In theory, the number of -feasible cuts grows exponentially with respect to . However, for , this computation is efficient in practice. For most benchmarks, we see about 30-70 five-feasible cuts per node.
Let be the largest LUT label among the fan-ins of . Instead of using max-flow min-cut procedures, BM-Map determines if by looking for a cut of height in the cutset . To compute the , let be the largest CLB label among the fan-ins of . BM-Map performs the following two checks.
(C1) If there exists a -feasible cut of height in , then . (C2) Otherwise, if there exists a nine-feasible cut of height of which the corresponding wide function can be successfully matched to the XC4K CLB, then . If both checks fail, then . We take two actions to save runtime in label computation. First, we do not exhaust all XC4K configurations in Boolean matching. We only test configurations (in Fig. 4) 6.a and 6.c for six cuts, 7.c and 7.f for seven cuts, 8.c for eight cuts, and 9.a for nine cuts because they have high matching percentages (Table II) with relatively low decomposition complexity. Second, to obtain wide cuts (of six to nine nodes) at each node , we simply merge the cuts in and , where and are fan-ins of . By doing so, we trade the completeness of nine-feasible cuts for runtime.
After every node in the network has been labeled, BM-Map generates a CLB mapping solution that respects labels. A PO node is a critical PO node if . For those critical POs and their fan-in networks, BM-Map covers them with CLBs to guarantee the CLB depth computed in the labeling phase. For the remaining noncritical POs and their fan-in networks, BM-Map covers them with LUTs to save area. At last, BM-Map packs LUTs into CLBs using an efficient procedure match 4k proposed in [15] . The BM-Map algorithm is outlined in Fig. 6 .
B. Technology Mapping for XC5200 FPGAs
While the XC4K CLB can implement a large number of six cuts and seven cuts, it was shown in previous sections that the XC5200 CLB implements only a very small percentage of them. However, we can exploit the XC5200 CLB architecture based on an interesting result observed in our experiments.
Let us first introduce two concepts. A wide function is fully implementable on a XC5200 CLB if each cofactor and depends on at most four variables for some (Theorem 11), while is partially implementable if only one of the two cofactors satisfies the condition. In Table V , the columns titled "fit" and "
-fit" show the number of cuts that are fully and partially implementable on XC5200 CLBs, respectively. Although very few six cuts and seven cuts are fully implementable, most of them are partially implementable. Based on this observation, we consider using XC5200 CLBs for partially implementable functions. [Note that most wide functions are fully implementable on XC4K CLBs (see Table II . Let denote the function of the subnetwork rooted at with inputs from [see Fig. 7(a) ]. Let and assume that . Then, is partially implementable on XC5200 CLB [see Fig. 7(b) ]. The dash line represents possible bridged inputs to LUTs. Decomposition operations are performed iteratively in three steps.
1) Choose a two-variable bound set . 2) Perform a simple disjoint decomposition under the bound set . 3) If the decomposition is successful, create a node for [see Fig. 7 (c)] and compute a min-cut of height in the cone rooted at . If the min-cut is feasible, then and, therefore, since all the inputs of , , and have a label . Steps 1 to 3 are iterated for all possible bound sets until a success is found or bound sets are exhausted. In the latter case, we assign .
VIII. TECHNOLOGY MAPPING EXPERIMENTAL RESULTS
We conducted experiments on a Sun ULTRA2 workstation with 256 MB of memory. The tested circuits are MCNC benchmarks which were optimized using the SIS rugged script [32] and decomposed into two-input networks using the dmig algorithm [7] , [35] . The mapping goal was to minimize the CLB depth with consideration to area minimization. In the first experiment, we applied FlowMap and BM-Map to map MCNC benchmarks into XC4K CLBs. After technology mapping, match 4k [15] was employed to pack LUTs into XC4K CLBs. We limited the maximal number of wide cuts to ten and 50 in BM-Map(10) and BM-Map(50) and reported corresponding results in Table VI . Compared to FlowMap, BM-Map obtained 14% and 18% smaller depth when ten and 50 wide cuts were tested, respectively. However, BM-Map uses substantially more CLBs compared to FlowMap. After careful examination, we found that a large percentage of 5-LUTs mapped by FlowMap were decomposed by match 4k into 2-LUTs and 4-LUTs and subsequently packed into CLBs with other 4-LUTs. This is very efficient for area minimization. This benefit does not happen to BM-Map because it uses LUTs only to cover noncritical portions of the input network and obtains much less 5-LUTs. In general, the more cuts that are tested for matching in BM-Map, the better the mapping results will be, but the longer the runtime. The ratio, however, is biased significantly by the circuit des.
In the second experiment, we applied our mapping approach to the technology mapping for XC5200 FPGAs. After mapping, LUTs were packed into XC5200 CLBs. One CLB is allocated for one 5-LUT as well as for one pair of 4-LUTs. We compared BMD-Map with the CutMap algorithm [12] . CutMap also inherits the FlowMap algorithm, but in addition performs simultaneous area minimization. It obtains the same LUT depth as obtained by FlowMap, but uses 18% less 5-LUTs on average for industrial benchmarks [22] . The mapping results are reported in Table VII 
IX. CONCLUSION
We have presented new Boolean matching methods for LUT-based PLBs and their applications to architecture evaluation and FPGA technology mapping. Our Boolean matching methods employ functional decomposition operations to represent functions in forms corresponding to the target PLB architecture. Existence conditions for new functional decomposition forms are given and proved. We applied the methods to the evaluation of PLB architectures in terms of logic implementation capability. Experimental results show that the Xilinx XC4K CLB can implement 98% and 88% of six-and seven-variable functions extracted from MCNC benchmarks, respectively, while a simplified PLB architecture implements the largest amount of functions per LUT bit. We developed new technology mapping algorithms that employ the Boolean matching techniques for depth minimization. Compared to conventional LUT mapping approaches, experimental results show 18% and 12% depth reduction on average for the Xilinx XC4K series and XC5200 series FPGAs, respectively, with up to 15% area reduction in XC5200 FPGAs. Our Boolean matching techniques can be useful for designing future FPGA architectures and better utilization of FPGA silicon resources.
