Abstract
Introduction
LUT-based FPGAs are used already for implementing digital designs in a wide variety of application domains. However, the search for new FPGA architectures and efficient methods of their synthesis remains a continuing area of research. This is motivated by considerations such as reducing delay, area, and power consumption, improving routability and resource utilization, and increasing expressive power of programmable logic. Reducing delay is especially important as it allows high-frequency FPGAs to compete with ASICs.
Delay in modern FPGAs is dominated by interconnect. A typical ratio between the intrinsic delay of a LUT and a wire delay is 1:5, because most of the wires are routed through multiple switch-boxes and routing channels.
One way to reduce routing delay is to use an FPGA architecture that has direct, or "non-routable", wires. Such wires can connect two adjacent LUTs in a programmable fabric, possibly at the expense of restricting the fanout count of the LUTs involved. In this paper, we investigate several ways direct wires can be used and experimentally show improvements in delay as a consequence.
We call a group of LUTs connected by direct wires, a LUT structure. It can be characterized by the number of interconnected LUTs, their sizes and connectivities. We investigate two single-output LUT structures, "44" and "444", shown in Figure 1 . The connections between the LUTs inside each structure are direct (non-routable), while all other connections are routable. In order to investigate the viability of these structures, we had to develop a new mapping algorithm, aimed at efficient implementation into such structures. The contributions of this paper are the following:
• Development of an efficient matching algorithm to check if a given Boolean function can be implemented using a given LUT structure.
• Modification to the priority-cut-based technology mapper [16] to perform mapping into the LUT structures, as opposed to single LUTs.
• Experimental evaluation of the new mapping algorithm applied to regular LUTs as well as the 44 and 444 structures. A promising conclusion is that the algorithm can improve delay and area also for the traditional mapping and substantially improve delay when LUT structures are used.
• Statistics on the relative number of k-input Boolean functions, appearing in industrial designs, that can be implemented using the 44 and 444 LUT structures. The paper is organized as follows. Section 2 describes the background and the relevant decomposition theory and Section 3 describes the mapping algorithm. Section 4 reports on some experimental results while Section 5 concludes the paper and outlines future work.
Background

Boolean network
A Boolean network (or netlist, or circuit) is a directed acyclic graph (DAG) with nodes corresponding to logic gates and edges corresponding to wires connecting the gates. In this paper, we consider only combinational Boolean networks. A combinational And-Inverter Graph (AIG) is a Boolean network composed of two-input ANDs and inverters. The size (area) of an AIG is the number of its nodes; the depth (delay) is the number of nodes on the longest path from the primary inputs (PIs) to the primary outputs (POs). A cut C of a node n is a set of nodes of the network, called leaves of the cut, such that each path from a PI to n passes through at least one leaf. Node n is called the root of cut C. For details on cuts and their roles in technology mapping, see [8] [16] . We have modified an existing cut-based mapper for mapping into LUTstructures.
As 
Decomposition theory
This section reviews the decomposition theory used in Section 4 of this paper. Refer to [12] [13] [19] [20] for details on the traditional decomposition algorithms.
For a Boolean function f(X) and a subset of its support, X 1 , the set of distinct cofactors, q 1 (X), q 2 (X), … , q μ (X), of f with respect to (w.r.t.) X 1 is derived by substituting all assignments of X 1 into f(X) and eliminating duplicated functions. The number of distinct cofactors, μ, is the column multiplicity of f with respect to X 1 .
Given a partition of X into two disjoint subsets, X 1 and
. Subsets X 1 and X 2 are called the bound set and the free set, respectively. Functions
The decomposition of f(X) with k functions g 1 (X 1 ), g 2 (X 1 ), …, g k (X 1 ) exists if and only if ⎡log 2 μ⎤ ≤ k ≤ n, where μ is the column multiplicity of f with respect to X 1 , and n = |X 1 |.
Algorithm
We use a notation for the LUT structures, e.g. "XY" or "XYZ", where the last character represents the root node and the first one or two characters represent the nodes feeding into it. For example, "345" represents a structure where a node of size 3 and a node of size 4 feed directly into a node of size 5. This section describes an efficient algorithm to determine whether a Boolean function can be implemented in "XY" or "XYZ", 2 ≤ k ≤ 6, k ∈ {X, Y, Z}.
The node size does not exceed 6 because most FPGAs are based on LUTs of 6 inputs or less. The support size of a function is limited to 16, because the truth table representation works well for such functions. In general, if runtime is not an issue, functions up to 20 inputs can be handled using truth tables. For larger functions, BDDs or a mixed AIG/SAT representation is preferable.
A special case of checking whether a function can be implemented using a LUT structure, is when the support sizes of the function and the structure are equal. In the case of the LUT structure "XY", if the support of the function is equal to X + Y -1, the only case when the decomposition exists, is when the function has a DSD with exactly X variables as a separate block. This check is handled in the pseudo-code below.
For example, 5-variable Boolean function F = MUX(c0, d0, MUX(c1, d2, d3)) can be matched with structure "33", because variables {c1, d2, d3} can be decomposed as a separate block, MUX(c1, d2, d3). On the other hand, this function cannot be matched with structure "24", because no two variables can be decomposed as a separate block.
Checking decomposition "XY"
Consider decomposition checking for the "XY" structure. The input to the algorithm is an n-input Boolean function and an ordered pair of numbers (X, Y) where 0 ≤ n ≤ 16 and 2 ≤ X,Y ≤ 6. The pseudo-code of the algorithm is shown in Figure 3 .1 below.
varset performLutMatchingXY( function F, // F is represented using a truth table Support minimization removes variable b from the support, resulting in function G = abc, whose support size is 3.
If n • max(X,Y), the function can be packed into one node, while the other node can be treated as a buffer. If n > X + Y -1, decomposition does not exist because the LUT structure is too small.
Procedure findOutputDecomposition() checks for a special case of decomposition, F = x ⊗ G, where ⊗ is a two variable Boolean function, x is a support variable, and G is a remainder function. For this, single-variable cofactors of F are checked and the following three special cases are considered: a constant-0 cofactor (AND-decomposition), a constant-1 cofactor (OR-decomposition), and two cofactors that are complements of each other (XOR-decomposition). Note that a pair of cofactors cannot be equal, because after support minimization, F depends on all its variables.
If the simple decomposition exists for one variable x, the check is iteratively applied to the remainder function G and its remainders, if any, until the number of decomposed variables is equal to Y-1. The decomposed variables are returned by the procedure findOutputDecomposition().
If the variable set returned is not enough to decompose the function, procedure findGoodBoundSet(), attempts are made to find a good bound set. This procedure tries to reorder variables in the truth table while minimizing the column multiplicity with respect to the topmost X variables. Reordering of variables in a truth table is similar to that in for BDDs, but typically is faster because a procedure has been developed, which swaps variables i and j directly, without going through a sequence of adjacent variable swaps, known as variable sifting for BDDs. The best column multiplicity and variable set are returned.
If the column multiplicity is 2 and the number of variables in the set is sufficiently large (n ≤ |V| + Y -1), the variable set is returned. Otherwise, a non-disjoint decomposition with one shared variable is attempted [12] . If such decomposition exists, the variable set is returned.
If decomposition is not found, the variable order is reversed and findGoodBoundSet() is called again. Reversing the variable order is a heuristic used for hard-tomatch functions, even though it does not guarantee that the match is always found if it exists. The reversing of the variable order can be done efficiently on a truth table.
An additional optimization omitted in the pseudo-code of Figure 3 .1, is that of caching previously computed results. Thus, when decomposition checking is applied repeatedly to the same function, it is retrieved from the cache, instead of running the check from scratch.
Checking decomposition "XYZ"
The input to the algorithm is an ordered set of integers (X, Y, Z) and a n-input Boolean function F, 0 ≤ n ≤ X + Y + Z -2, where 2 ≤ X,Y,Z ≤ 6 and 0 ≤ n ≤ 16.
The decomposition check in this case is implemented as two checks: checking F for decomposition using "XW", where W = Y + Z -2. If it exists, the remainder function G is checked for decomposition using structure "YZ".
To ensure that the resulting structure has blocks "X" and "Y" feeding directly into block "Z", instead of block "X" feeding into block "Y" and block "Y" feeding into block "Z", the algorithm in Figure 3 .1 is modified slightly. When the feasibility checks are performed, the output variable of the first block "X" is not included into the variable sets returned by the checking procedures.
Modifying the technology mapper
The priority-cut-based technology mapper [16] needed to map into the "XY" and "XYZ" LUT structures allows the user to define custom cost functions for evaluating cuts during mapping. In the case of mapping into the "44" ("444") structures, the mapper performs enumeration of 7-input (10-input) cuts, computes their cut functions as truth tables, and checks the decomposition for each cut function. If the function is not decomposable, the cut is skipped. If it is decomposable, the area and delay of the resulting cut are defined using the number of inputs of the function, as specified in the given LUT library. A LUT library lists delay and area of the LUT of each size. Examples of LUT libraries are given in Section 5.
The mapper minimizes the delay of the mapping, followed by several rounds of heuristic area recovery. When the user-specified cost functions are employed, as described above, all the cuts used in the mapping have Boolean functions decomposable into the "XY" or "XYZ" LUT structures.
Experimental results
The proposed algorithm is implemented in ABC [2] [4] as part of the priority-cut-based technology mapper [16] (command if). The following are the relevant switches that have been added to the technology mapper:
• -S 44 enables mapping into "44" structure,
• -S 444 enables mapping into "444" structure,
• -K <num> specifies the cuts size. Experiments were performed using a suite of public benchmarks and a suite of industrial benchmarks.
Improvements to traditional mapping
In the first experiment, reported in Table 4 .2, the target structure is the traditional 4-LUT FPGA. In this case, when switch "-S" is used, the delay/area of the LUT structure "44" ("444") are the same as the area/delay of two (three) 4-LUTs. The following runs are performed and reported in the columns of Table 4 .2. The notation (cmd1; cmd2) n means that cmd1 followed by cmd2 was iterated n times.
• Baseline: This runs if with LUT Library L1 shown in 4 with Library L2. This library specifies delay/area equal to 1 for each LUT up to size 4, and equal to 2 for larger LUTs.
• 444: This runs (dch; if -S 444) 4 using Library L4. This library specifies delay/area equal to 1 for each LUT up to size 4, and delay/area equal to 2 or 3 for larger cuts.
• Best 444: This runs (dch; if -S 444) 4 using Library L5. This library estimates the quality of mapping into the LUT structure "444". For each cut size, the library contains the approximate number of LUTs needed to implement it, assuming that 50% of 7-input functions are mapped into two 4-LUTs and 50% are mapped into three 4-LUTs. Table 4 .2 shows that, when the proposed algorithm is used, both area and delay are improved, compared to mapping with structural choices (MSC). For example, when mapping is performed using 7-input cuts matched with the "44" LUT structures composed of two 4-LUTs, the area and delay are improved by 4.6% and 6.1%, respectively. When mapping using 10-input cuts matched with the "444" LUT structures composed on three 4-LUTs, the area and delay are improved by 7.4% and 11.3%, respectively.
Delay-optimization using LUT structures
In this experiment, reported in Table 4 .3 and Table 4 .4, we assume that FPGA hardware allows for efficient realization of the "44" and "444" structures with direct (non-routable) connections between the adjacent LUTs.
The runs performed are the same as in the previous experiment (Section 4.1), except that the delay of the direct connection is assumed to be 1.2 instead of 2. The updated LUT libraries are listed below: 44 (Library L3), 444 (Library L6), Best 444 (Library L7). The results of this experiment show that the delay can be substantially reduced at the cost of some increase in area, compared to mapping with structural choices (MSC).
Consider Table 4 .3 as an example. When mapping is performed using 7-input cuts matched with the "44" LUT structures, the delay is reduced by 28.2% while the area is increased by 5.1%. When mapping is performed using the "444" LUT structures, the delay is reduced by 43.2% while the area is increased by 14.1%. The area increase may be prohibitive and will be addressed as part of future work.
The ratios of realizable functions
In a separate experiment not shown in the tables, we evaluated the relative number of 7-input (10-input) functions appearing in the circuits that can be matched with the "44" ("444") LUT structures. The results differ for the industrial designs and for the public benchmarks listed in Tables 4.2 and 4.3. For the industrial designs, 97% (70%) of 7-input (10-input) functions can be matched with the "44" ("444") LUT structure. For the public benchmarks, the numbers are 99% and 84%, respectively. It is surprising that such a high percentage of cuts have Boolean functions that can be matched with the LUT structures.
Conclusions
This paper proposes an improvement to technology mapping for FPGAs. The main idea is to reduce structural bias by mapping into LUT structures composed of two or three 4-LUTs. The experimental results indicate that the new algorithm can improve both area and delay of the traditional technology mapping.
Additionally, an FPGA architecture allowing for direct connections between pairs of adjacent LUTs is evaluated.
When the algorithm is used to map into this architecture, the delay improvement can be up to 40% at the cost of some area increase.
Future work will explore whether the delay reductions reported in this paper translate into improved maximum clock frequency (Fmax) after place-and-route. 
