FPGA logic synthesis and technology mapping have been studied extensively over the past 15 years. However, progress within the last few years has slowed considerably (with some notable exceptions). It seems natural to then question whether the current logic synthesis and technology mapping algorithms for FPGA designs are producing near-optimal solutions. Although there are many empirical studies that compare different FPGA synthesis/mapping algorithms, little is known about how far these algorithms are from the optimal (recall that both logic optimization and technology mapping problems are NP-hard if we consider area optimization in addition to delay/depth optimization). In this paper we present a novel method for constructing arbitrarily large circuits that have known optimal solutions after technology mapping. Using these circuits and their derivatives (called LEKO and LEKU, respectively), we show that although leading FPGA technology mapping algorithms can produce close to optimal solutions, the results from the entire logic synthesis flow (logic optimization + mapping) are far from optimal. The best industrial and academic FPGA synthesis flows are around 70 times larger in terms of area on average, and in some cases as much as 500 times larger on LEKU examples. These results clearly indicate that there is much room for further research and improvement in FPGA synthesis. Given an RTL design, the typical FPGA synthesis process consists of RTL elaboration, logic synthesis, and the physical design (layout synthesis) [10] . In this paper we will focus on logic synthesis, which can be broken down into two main steps: logic optimization and technology mapping. Logic optimization transforms the current gate-level network into an equivalent gate-level network more suitable for technology mapping. Technology mapping transforms the gate-level network into a network of programmable cells (in our case these cells are LUTs) by covering the network with these cells. Several algorithms perform logic optimization during technology mapping. As an example, Figure 1 shows the difference between mapping algorithms that use logic optimization and those that do not. By examining the logic function of f, we can see it just takes the logical AND of all of its inputs; thus, by manipulating the circuit we can reduce the mapping solution by one 4-LUT. Since the size of the circuit will be directly proportional to the price of an FPGA that can implement it, the logic synthesis step will play an integral role in the design flow. 
INTRODUCTION
Field programmable gate arrays (FPGAs) have been gaining momentum as an alternative to application-specific integrated circuits (ASICs). FPGAs consist of programmable logic, I/O, and routing elements which can be programmed and reprogrammed in the field to customize an FPGA, enabling it to implement a given application in a matter of seconds or milliseconds. The most common type of programmable logic element used in an FPGA is called a K-LUT, which is a K-input one-output lookup table (LUT), capable of implementing any K-input one-output Boolean function.
Given an RTL design, the typical FPGA synthesis process consists of RTL elaboration, logic synthesis, and the physical design (layout synthesis) [10] . In this paper we will focus on logic synthesis, which can be broken down into two main steps: logic optimization and technology mapping. Logic optimization transforms the current gate-level network into an equivalent gate-level network more suitable for technology mapping. Technology mapping transforms the gate-level network into a network of programmable cells (in our case these cells are LUTs) by covering the network with these cells. Several algorithms perform logic optimization during technology mapping. As an example, Figure 1 shows the difference between mapping algorithms that use logic optimization and those that do not. By examining the logic function of f, we can see it just takes the logical AND of all of its inputs; thus, by manipulating the circuit we can reduce the mapping solution by one 4-LUT. Since the size of the circuit will be directly proportional to the price of an FPGA that can implement it, the logic synthesis step will play an integral role in the design flow. As the FPGA technology gained popularity throughout the 1990s, a large amount of work was published that dealt with logic synthesis and/or technology mapping of FPGAs, including Chrotlecrf [19] , MIS-pga [25] , XMap [22] , VisMap [30] , TechMap [27] , Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Praetor [13] , DAG-map [8] , EdgeMap [31] , FlowMap [12] , Zmap [15] , Cutmap [11] , BoolMap [23] and FlowSYN [9] . These mapping algorithms employ many different techniques to achieve their solutions, including dynamic programming, bin packing, BDDbased logic simplification, and cut enumeration, just to name a few. Some of these algorithms focused on delay minimization [19] [9] . All of these algorithms were developed over a ten-year period in the 1990s, but after this influx the amount of new published work began to decrease steadily, with only a few novel algorithms emerging in the past few years-such as IMAP [3] , Hermes [18] , DAOmap [7] and the ABC mapper [1] [32] . To many people, this signaled that FPGA synthesis algorithms had probably hit a plateau. It is natural to then question whether the current logic synthesis and technology mapping algorithms for FPGA designs are producing near-optimal solutions. However, although there are many empirical studies that compare different FPGA synthesis/mapping algorithms, little is known about how far these algorithms are from the optimal (recall that both logic optimization and technology mapping problems are NP-hard if we consider area optimization in addition to delay/depth optimization).
In fact, a similar question was raised a few years ago when placement research slowed down. However, using a set of cleverly constructed examples, called PEKO (Placement Examples with Known Optimal) examples, the study in [6] showed the surprising results that wirelengths produced by state-of-the-art placement tools at that time were 1.66 to 2.53 times the optimal solutions in the worst cases. These results generated a renewed interest in placement research; within two to three years a large body of papers was published on placement optimality studies (e.g., [17] [4] ). Within three years, the optimality gap on the PEKO examples was reduced to roughly 20% on average [4] .
Unfortunately, there is no simple way to extend the ideas of testing placement optimality to logic synthesis because of the inherent differences in the two problems. Therefore, little progress has been made on testing the optimality of logic synthesis algorithms. The research in [24] presented a method which could only create very small structureless test cases, and they were used to test very primitive mappers. Another method, described in [2] , used a SAT solver as an exact logic synthesis tool for LUT-based FPGAs to see how much more the circuit area could be reduced by postprocessing the mapping solutions produced by existing mappers. But the results suggested that current mappers could not be easily improved. This is largely due to the highly localized search algorithm used in this approach (SAT-based optimal logic optimization is applied to logic cones of up to ten inputs).
In this paper we present a novel method for constructing arbitrarily large circuits that have known optimal solutions after technology mapping or known upper-bound solutions after logic optimization and technology mapping for LUT-based FPGAs. Using these circuits (called LEKO and LEKU), we show that although leading FPGA technology mapping algorithms can produce close to optimal solutions, the results from the entire logic synthesis flow (logic optimization + mapping) are far from optimal. The best industrial and academic FPGA synthesis flows are around 70 times larger in terms of area on average, and in some cases as much as 500 times larger on LEKU examples. These results clearly indicate that there is much room for further research and improvement in FPGA synthesis.
CONSTRUCTION OF BENCHMARKS 2.1 Construction of LEKO Examples
We present an algorithm for constructing a network G n (with n inputs and outputs) of an arbitrarily large size that has a known optimal technology mapping solution. G is constructed in a special way by replicating a small circuit with a known optimal mapping solution into a circuit of any size that also has a known optimal mapping solution. These circuits are called LEKO (Logic synthesis Examples with Known Optimal) examples. The building block of our construction is a "hard graph" named G5, with the following properties:
1. It has five inputs and five outputs.
Every output is a function of all five inputs.
3. Each internal node of G5 has exactly two inputs.
4. There exists an optimal (in terms of area/depth) mapping of G5 into a 4-LUT mapping solution, denoted M5, such that M5 only has 4-LUTs (no 3-LUTs or 2-LUTs). For the G5 shown in Figure 2 , M5 has exactly seven 4-LUTs.
The specific G5 we used to construct our LEKO and LEKU benchmarks can be seen in Figure 2 , and the optimality of its technology mapping solution is stated in Theorem 1 below and verified in its proof.
Theorem 1:
G5 has an area-optimal technology mapping solution of seven 4-LUTs.
Proof
This is proved using the binate cover technique, which is able to compute the minimum-area technology mapping solution. In particular, we use the binate covering solver in SIS [28] using the command "xl_cover -h 0." The binate solver in our case returned a seven 4-LUT solution. The reason this method cannot be used to prove the optimality of the larger LEKO circuits is because this tool is computationally infeasible for returning an optimal binate covering solution for any graph with more than 100 logic gates. ■ 
Node Values
Figure 2. An example of G5
Using this newly created G5, a LEKO circuit G is created by stacking up G5s in such a way that from the outputs of G, there is only one way to traverse G5 to get to the inputs. The exact algorithm is presented in Figure 3 , where createLEKO(L) creates a LEKO example with L·5 L-1 G5s in L layers. In the algorithm, the U (union) operator does not disturb the order of the inputs or outputs. For example, when looking at the A U B we can think of the inputs and outputs as an array of nodes. Then the index of every input node from A appears before any input node from B in A U B; likewise the property holds for the output nodes. The Copy operator creates a copy of the network renaming all the nodes; createEdge(x,y) just creates an edge from x to y; and G output [i] is the i th output of G.
U copy(G5) ;
end-for output G;
Figure 3. createLEKO algorithm
Logically, the createLEKO algorithm works as follows. It builds up the graph using layer upon layer of G5s in order to get a LEKO example of L layers. It first creates the bottom layer of 5 L-1 G5s, then for each additional layer it makes 5 L-1 copies of G5 and proceeds to connect the outputs of the graph G to the inputs of the newly created layer. It spreads out the connections in such a way that for an arbitrary G5 at the top layer, there exists a path to it from every G5 at the bottom layer (i.e., every G5 at the top level is connected to every G5 at the bottom level). Thus, using this algorithm and any number n, one can create a LEKO circuit having more than n nodes and a known optimal technology mapping solution whose optimality is proven in Theorem 2. By using this method, we were able to construct G 25 (seen in Figure 4 ) by calling createLEKO with two layers. We similarly constructed G 125 ( Figure 5 ) with three layers, and we constructed G 625 with four layers.
Theorem 2:
The optimal mapping solution of an arbitrarily sized LEKO circuit without logic optimization is achieved when every G5 in the circuit is mapped optimally without overlapping any other G5. Proof: See Appendix. 
Construction of LEKU Examples
A LEKU (Logic synthesis Examples with Known Upper-bounds) example LEKU(G) is derived from the LEKO example G after collapsing and gate decomposition of G. Clearly, the optimal optimization + technology mapping solution of G provides an upper bound on the area of the corresponding LEKU(G) example due to the functional equivalence of G and LEKU(G). In the paper we focus on constructing LEKU(G 25 ), which may already result in over 1 million gates after collapsing and decomposition. It is not reasonable to require the existing FPGA synthesis tools to handle larger examples beyond such sizes.
In fact, after collapsing of G 25 , we tried different decomposition algorithms-LEKU-CD(G 25 ) was constructed by first collapsing LEKO(G 25 ) into a two-level network, then decomposing the result into an equivalent two-bounded simple-gate network (using SIS [28] commands collapse and tech_decomp), while LEKU-CB(G 25 ) was constructed by first collapsing the network, then balancing was done using the collapse and balance commands of the ABC system [32] . From the circuit size profile shown in Table  1 , one can see that ABC's internal canonical AND-INV representation leads to the removal of a large number of functionally equivalent gates.
Since Xilinx's mapper could not accept a circuit as large as LEKU-CD(G 25 ), we broke LEKU-CD(G 25 ) up into a collection of non-overlapping circuits, one circuit for each primary output. The resulting collection of circuits is clearly equivalent to LEKU-CD(G 25 ), and denoted as LEKU-CD(G 25 )'.
RESULTS
The results of our study will be presented in two parts: we first discuss the LEKO circuits created by createLEKO (G 25 , G 125 and G 625 ), and we then discuss the LEKU examples which are functionally equivalent circuits of LEKO(G 25 )-LEKU-CD(G 25 ), LEKU-CD(G 25 )' and LEKU-CB(G 25 ). The details of these circuits are shown in Table 1 . Using these examples, we present the results of running state-of-art academic FPGA mappers DAOmap [7] , ABC [32] and the leading-edge FPGA synthesis systems from Altera [33] and Xilinx [34] on each of the circuits. DAOmap from the SIS [28] and RASP [15] environment was used with options allowing only the use of LUTs with four or less inputs. Berkeley's ABC mapper was used from the ABC [32] environment also for mapping into 4-LUTs. Note that DAOmap produces a depthoptimal mapping solution as FlowMap [12] but uses 29% less LUTs on average as calculated from [7] . ABC mapper also produces depth-optimal mapping solutions, but uses 7% less LUTs than DAOmap on average as reported in [1] . Altera's logic synthesis tool was run from Quartus 5.0 [33] using Stratix device EP1S80F1508I7 and the option for area optimization. Xilinx's logic synthesis tool was run from Xilinx ISE 7.1i [34] using Virtex device xcv3200e and also the option for area optimization. For the interests of this study, we only performed the logic synthesis steps of these tools and did not go through final placement and routing. The depths of the mapped LEKO circuits are not reported here for two reasons: Xilinx and Altera optimize for delay instead of depth, and the final logic element in the Xilinx device is not a 4-LUT but a slice that combines two 4-LUTs. 
Synthesis Results on LEKO Examples
The section will illustrate how well mappers perform in achieving the optimal mapping solution if they do not have to carry out logic optimization. We tested this by running each one of the LEKO circuits on each one of the mappers. As the results in Table 2 show, each mapper and logic synthesis tool does a fairly good job mapping the benchmarks. The average gap from optimal varies from 5% (by Quartus) to 23% (by DAOMap), with an average of 15%. This shows that the current LUT-based FPGA mappers or synthesis tools perform quite well on circuits where logic optimization is not needed to get the optimal solutions. Note that Quartus and ISE perform both logic optimization and technology mapping, while DAOMap performs technology mapping only and ABC performs some logic optimization during mapping. 
Synthesis Results on LEKU Examples
This section will illustrate how poorly most of the best available FPGA logic synthesis flows perform when logic restructuring and/or optimizing is needed to achieve the optimal mapping solution. The academic mappers presented in this section, are allowed to use standard preprocessing tools (script.algebraic for DAOmap and resyn2 for ABC mapper) for technology-independent logic optimization since the LEKU examples require logic restructuring/optimization to achieve the optimal mapping solutions. From the results in Table 3 , we see that all four synthesis flows perform poorly and produce synthesis results with area ranging from 71X to 504X larger than the known upper bounds (the mapping results of the equivalent LEKO examples), averaging 172X larger. We believe Quartus produced a better solution on LEKU-CD' than LEKU-CD because it could perform more optimizations on each one of the circuits of LEKU-CD' due to their smaller size. One of the main reasons that every one of these algorithms performed so poorly is because they were not able to reconstruct the original structure of the circuit. The fact that the same logic synthesis flows perform so much worse on the LEKU examples than the equivalent LEKO examples suggests that the existing logic optimization algorithms are not capable of reproducing the initial circuit structure of the LEKO examples. This suggests that there maybe significant opportunity for improvement of the existing logic synthesis algorithms. For example, we believe that in order for logic synthesis algorithms to perform well on the LEKU examples, they must have a more global view of resynthesisincluding duplication removal, logic identification, and many other heuristics that examine the circuit globally. Without such global heuristics, algorithms do not perform well on LEKU examples and may produce poor results on large real-world circuits as well. 
CONCLUSIONS
In this paper we presented an algorithm for creating synthetic benchmarks with known optimal technology mapping solutions for LUT-based FPGA designs. Using these LEKO (Logic synthesis Examples with Known Optimals) and LEKU (Logic synthesis Examples with Known Upper bounds) benchmarks of sizes ranging from a few hundred nodes to over one million nodes, we experimented on four state-of-the-art FPGA logic synthesis flows. We show that although leading FPGA technology mapping algorithms can produce close to optimal solutions with an average gap of 15% on the LEKO examples, the results from the entire logic synthesis flows (logic optimization + mapping) are far from optimal. The best industrial and academic FPGA synthesis flows are around 70 times larger in terms of area on average, and in some cases as much as 500 times larger on LEKU examples.
We hope that these surprising results and examples will stimulate the logic synthesis community as did the PEKO examples in the physical design community. Needless to say, the potential of large-scale area reduction is of great interest to the IC and EDA industries. If realized, it leads to significant improvement on density and cost of future integrated circuits. It is possible that the artificial examples constructed by our algorithm may not appear in every real-life circuit. However, these examples will help to identify deficiencies in the current logic synthesis algorithms and improve their quality.
Although our optimality study is done for LUT-based FPGAs, we think that the same technique can be easily extended to cell-based IC designs, where one needs to map to a library of different logic cells. In this case, we need to modify the construction of G5 so that it remains to be a "hard core" basis of constructing larger hard examples.
The LEKO and LEKU examples are available online at [35] .
Acknowledgements

APPENDIX Summary of the Proof of Theorem 2
Now that we have the ability to construct arbitrarily sized LEKO circuits, we can show that this construction actually creates a circuit G with a known optimal binate cover, which we prove in Theorem 2. Assuming we have an arbitrary LEKO circuit G with L layers, we prove Theorem 2 by induction over the layers of G. Claim 1 will be used in almost all of the other claims as it proves that there are no reconverging paths of G5s. Claim 2 and Claim 3 will help prove the base case, while Claim 4 which, working with Claim 2 and Claim 3, helps prove the inductive step.
Claim 1: Tree-like structure of G5s (No reconverging paths of G5s) Given an arbitrary G5, x, on the top layer and G, starting at any G5 at the bottom layer there is only one way to traverse the G5s to get to x. Proof Assume we start at an arbitrary G5, call it x, on the top layer, and from the construction it should be obvious that a path exists from any G5 at the bottom layer to x (i.e., x is connected to every G5 on the bottom layer). Now let us consider the maximum number of G5s we are connected to after one layer, which is 5 (since x has 5 inputs). Similarly, after two layers the maximum number of G5s that are connected to x is 5 2 (since x has 5 inputs and the G5s that feed its inputs also have 5 inputs), and the maximum number of G5s that can reach x at layer L (after L-1 layers) is 5 L-1 . Now, if there were any reconverging paths connecting x to the rest of the G5s, there would be strictly less than 5 L-1 G5s at the bottom layer that can reach x. By construction it should be obvious that every G5 at the bottom layer can reach x, therefore there are no reconverging paths.
Claim 2: Mapping upward (layer i)
Mapping the nodes in G5 at layer i so that the resulting LUT takes nodes from layer i and i+1 (i.e., mapping upward across a layer) requires one more LUT than mapping within the layers.
Proof
Assuming that the inputs to G5 on layer i are already LUTs, we know that the optimal mapping of each G5 has everything packed as tightly as possible (it only uses 4-LUTs), so in order to extend into layer i+1, one of the LUTs (in the optimal mapping of a G5) has to split into two separate LUTs, thereby creating one additional LUT. This is because every output of a particular G5 in layer i will never combine with another output from that G5 in any layer above i (Claim 1).
Claim 3: Mapping upward (layer i+1)
Any extensions of LUTs from layer i into layer i+1 will not result in layer i+1 being mapped with fewer LUTs than mapping within the layers. Proof Assume that the number of LUTs to map a G5 optimally is N, and the LUT that spans layer i and layer i+1 is called x. Since LUT x is partially in layer i and partially in layer i+1, the LUT has at most three inputs in layer i+1. Since this LUT in layer i+1 has only three inputs to choose from, and we know the optimal mapping for G5 in layer i+1 is strictly made up of LUTs with exactly four inputs, this will result, in the best case, in a mapping for the G5 in layer i+1 with N LUTs plus one spanning the two layers. Another way to look at this is to consider the question: Can you map the G5 on layer i+1 using N-1 4-LUTs and one 3-LUT? Consider Figure 6 for a pictorial representation of the question proposed. The answer to this question is clearly no because of the optimality of the N LUTs needed to map G5. 
Proof
Assume that the inputs to G5 at layer i are already LUTs. We know that in the optimal mapping of each G5 everything is already packed as tightly as possible (it only uses 4-LUTs), so extending into layer i-1 will not be possible unless there are reconverging paths at some layer below i. However, this is impossible. Due to the tree structure of G, every input into every G5 at layer i will never meet again; this was proven in Claim 1.
Proof of Theorem 2
Let G be an arbitrary LEKO circuit with L layers constructed using the above construction. Let us define a property that we will use in the proof. Property P(n): Let P(n) mean that the optimal mapping for all nodes up to layer n is the optimal mapping of each G5 separately.
It is then enough to show that P(1) is true and if P(m-1) ⇒ P(m), where 2 ≤ m ≤ L (which will show by induction that the optimal mapping of our arbitrary G is just the optimal mapping of each G5 separately).
Base case P(1) is true. Proof Before we begin the proof, this is what "all nodes up to layer 1" looks like:
(a total of (L-1)
5 G5s) Now that we have an understanding of what this looks like, let us consider all the possible ways to map all nodes up to layer 1.
Case a: Mapping exactly all the nodes of layer 1 and not mapping any nodes of layer 2.
Since there is no overlap between the G5s (thus trying to pack nodes from different G5s into one LUT cannot possibly reduce area) and we know the optimal mapping of G5, the optimal mapping of this layer will result in mapping each G5 separately.
Case b: Mapping exactly all the nodes of layer 1 and possibly mapping some nodes of layer 2. Now we have to consider the case where the optimally mapped 4-LUT solution for layer 1 maps some nodes in layer 2. But from Claim 2 (its assumption holds since we are at the lowest layer and all the inputs are primary inputs) and Claim 3, we see that it will not help mapping if LUTs span across layer 1 and into layer 2; thus case b cannot happen and case a must happen. From case a we can see that P(1) must hold, thus the base case is proved.
Inductive step
P(m-1) ⇒ P(m).
Proof
Recall that P(m) is saying that the optimal mapping for all nodes up to layer m is the optimal mapping of each G5 separately. Since P(m-1) is assumed, we know that the optimal mapping for all nodes up to layer m-1 is the optimal mapping of each G5 separately. Now all we need to know to prove P(m) is that any LUTs spanning two separate layers will not result in a better mapping solution (Claim 4). With Claim 2, which uses the inductive hypothesis P(m-1) to uphold the assumption that all the inputs to layer m are already LUTs, and Claim 3, we know that the mapping of nodes up to layer m will not intrude on layer m+1. And with Claim 4, we know that the mapping will not create LUTs that intrude into any layer below m. Thus, the optimal mapping for layer m is to map each G5 separately, and the inductive step is proven. Therefore, by induction, the optimal mapping of G is that which maps every G5 optimally and separately.
