Abstract-We leverage properties of the logic synthesis netlist to define both a new field-programmable gate-array (FPGA) logic element (function generator) architecture and an associated technology mapping algorithm that together provide improved logic density. We demonstrate that an "extended" logic element with slightly modified -input lookup tables (LUTs) achieves much of the benefit of an architecture with + 1-input LUTs, while consuming silicon area close to a -LUT (a -LUT requires half the area of a + 1-LUT). We introduce the notion of "non-inverting paths" in a circuit's AND-inverter graph (AIG) and show their utility in mapping into the proposed logic element architectures. We propose a general family of logic element architectures, and present results showing that they offer a variety of area/performance tradeoffs. One of our key results demonstrates that while circuits mapped to a traditional 5-LUT architecture need 15% more LUTs and have 14% more depth than a 6-LUT architecture, our extended 5-LUT architecture requires only 7% more LUTs and 5% more depth than 6-LUTs, on average. Nearly all of the depth reduction associated with moving from -input to + 1-input LUTs can be achieved with considerably less area using extended -LUTs. We further show that 6-LUT optimal mapping depths can be achieved with a small fraction of the LUTs in hardware being 6-LUTs and the remainder being extended 5-LUTs, suggesting that a heterogeneous logic block architecture may prove to be advantageous.
LUTs. We speculate that a reason for this may be a lack of research focus on synthesis techniques for easy targeting and evaluation of non-LUT-based logic block architectures. In this paper, we consider FPGA logic block architecture and propose logic elements with superior area-efficiency, as well as a simple mapping strategy for the proposed logic elements.
The tight interaction between architecture and computer-aided design (CAD) for FPGAs is well-established. The approach taken in most FPGA architecture research is hardware-driven in the sense that the hardware "idea" comes first into the architect's mind, and CAD tools are subsequently developed to target and gauge the benefit of the proposed hardware. A classic work by Rose et al. studied what LUT size provides the best area-efficiency in FPGAs [1] . Here, we revisit the question of area-efficiency, however in our case, architecture research is turned upside down and the reverse approach is taken: the CAD tools themselves suggest a particularly natural architecture.
Raising logic density is the goal of this research and is one which we believe to be well-motivated. There has recently been a trend toward larger LUTs in commercial FPGAs. The LUTs in the Xilinx Virtex-5 FPGA and the Altera Stratix-III FPGA can realize 6-input logic functions [2] , [3] . However, since customer designs contain only a limited number of large functions, LUTs in commercial chips are multi-output and fracturable into several smaller LUTs. For example, the 6-LUT in Virtex-5 can implement any 6-input function, or any two functions that together use up to five distinct inputs. LUTs in Stratix-III offer even more fracture-flexibility and can implement two independent 4-LUTs. From the vendor perspective, while the delay benefits of 6-LUTs are desireable, care has been taken to mitigate under-utilization and achieve high logic density [4] . In this paper, we re-examine the LUT structure and challenge the conventional wisdom that full -input LUTs are necessary to implement -variable logic functions in FPGA logic blocks. We show that a smaller -input logic element that uses fewer transistors can be used in place of a -LUT, with relatively little impact to circuit speed. Logic density is improved through the use of the proposed logic elements. In this paper, logic density refers to amount of silicon area (based on transistor count) required to implement a given circuit.
We propose a new technology mapping approach and FPGA logic element architecture, both of which are motivated by properties of the logic synthesis netlist. Our mapping approach and architecture are relatively small variations on published techniques. Specifically, we use properties of the logic synthesis netlist to identify "gating" inputs to LUTs, where a gating input is one that has a particular logic state (logic-0 or logic-1) that forces the LUT output to evaluate to a particular logic state (logic-0 or logic-1). We give a simple scheme for finding such gating inputs and show that they occur frequently in circuits. We leverage the gating input concept in the design of new logic element architectures that offer considerably better logic density compared with those in today's commercial FPGAs. Our combined "architecture + CAD approach" to attack logic density is an entirely new direction and is in contrast to recent CAD work which used don't cares to reduce LUT count in mapped circuit implementations [5] .
A preliminary version of a portion of this paper appeared in [6] . In this extended journal version, we generalize the mapping approach to discover additional gating input signals to LUTs. In particular, we offer an approach to identify inputs that cause the LUT function to evaluate to logic-1, in addition to the logic-0 case described in the conference version. The extended LUT architecture is altered accordingly to handle the scenario where a gating input forces a function to logic-1. We propose a generalized family of extended LUT architectures, each offering a different area/performance tradeoff. Finally, we present more comprehensive experimental results, including results for an additional set of benchmark circuits.
The remainder of this paper is organized as follows. Background and related work appear in Section II. Our mapping approach and the proposed logic element architecture are described in Section III. An experimental study is presented in Section IV. Conclusions and suggestions for future work are offered in Section V.
II. BACKGROUND

A. LUT Hardware Architecture
A -input LUT ( -LUT) is a hardware implementation of a truth table that can implement any logic function that requires up to variables. Central to our work is the property that the required silicon area to implement a LUT increases exponentially with the number of LUT inputs. Fig. 1 shows the hardware for 2-and 3-input LUTs. Three-input LUTs have eight SRAM cells to hold the truth table of the logic function implemented by the LUT, and require a tree of seven 2-to-1 multiplexers. 2-LUTs have four SRAM cells and require three 2-to-1 multiplexers. In general, -LUTs have SRAM cells and multiplexers. Six-LUTs in today's commercial FPGAs have 64 SRAM cells. Adding additional inputs to a LUT is clearly a costly endeavor, and the thrust of our work is an approach for "getting away with" using smaller LUTs, while at the same time realizing the benefits of larger LUTs.
B. FPGA Technology Mapping
Research on technology mapping for FPGAs was active in the early 1990s with a wide range of algorithms proposed (e.g., [7] , [8] ). In comparison with technology mapping for application-specific integrated-circuit (ASIC) standard cells, the FPGA mapping problem is simplified as a consequence of the target gate being a -LUT that can implement any -variable function. FPGA mappers need not focus on finding logic functions in circuits that match with gate functions in a target library; rather, FPGA mappers must cover a circuit with -variable functions (any -variable functions). Recent FPGA technology mappers are based on the notion of cuts [9] , [10] , which we review here. The combinational portion of a circuit can be represented by a directed acyclic graph (DAG) , where each node, , represents a single-output logic function and edges between nodes, , represent input/output dependencies among the corresponding logic functions. For a node in the circuit DAG, let represent the set of nodes that are fanins of . Fig. 2(a) illustrates the notion of a cut for a node . A cut for is a partition, , of the nodes in the subgraph rooted at , such that . For 's cut in Fig. 2(a) , consists of two nodes, and . A cut is called -feasible if the number of nodes in that drive nodes in is . In the case of Fig. 2(a) , there are three nodes that drive nodes in , and, the cut is three-feasible. For a cut represents the nodes in that drive a node in . For the cut in Fig. 2(a) , . represents the set of nodes . For a -feasible cut, , the logic function of the subgraph of nodes can be implemented by a single -LUT (since the cut is feasible and a -LUT can implement any function of variables). The key point to realize is that the problem of finding all of the possible -LUTs that generate a node's logic function is equivalent to the problem of finding all -feasible cuts for the node. Generally, there can be multiple -feasible cuts for a node in the network, corresponding to multiple LUT implementations.
represents the set of all feasible cuts for a node .
Traversing the circuit DAG in topological order, the cuts for a node are generated by merging cuts from its fan-in nodes, using the method described in [9] , [10] , and outlined here. Consider generating the -feasible cuts for a node with two fan-in nodes, and . The list of -feasible cuts for and have already been computed, due to the graph traversal order. Assume that node has one -feasible cut,
, and node has one -feasible cut, , as shown in Fig. 2(b) . We can merge and to create a cut, , for node , such that and [see Fig. 2(b) ]. If , the resulting cut is not -feasible, and is discarded. In this example, input nodes and have only one cut each, however, if instead they had multiple cuts, all possible cut merges would be attempted to form the complete cut set for . This provides a general picture of how the cut generation procedure works, however, there are several special cases to consider and the reader is referred to [9] for details.
Having computed the set of -feasible cuts for each node in the DAG, the graph is traversed in topological order again. During this second traversal, a "best cut" is chosen for each node. The best cut may be chosen based on any criteria, whether it be area, power, delay, routability, or a combination of these. The best cuts define the LUTs in the final mapped solution.
C. ABC Synthesis Framework
Work on logic synthesis has been reinvigorated by the introduction of the ABC system developed at UC Berkeley [11] . In ABC, the circuit DAGs are AND-inverter graphs (AIGs), that is, logic functionality is represented as a network of two-input AND gates connected by invertible edges. An example of an AND-inverter graph is shown in Fig. 3 . The use of AND inverter graphs (AIGs) eases the implementation of many useful logic synthesis transformations (e.g., [12] , [13] ).
Among other advantages, AIGs have proven to be effective for cut-based LUT mapping. In [14] , Mischenko et al. introduced the notion of priority cuts, where instead of storing all possible cuts for each node, only a subset of "priority cuts" is stored, based on a cost function. When generating the cut set for a node, only the priority cuts of its fan-in nodes are considered for merging. Despite that many cuts are pruned with this technique, little quality degradation is observed in practice, and results are comparable to any competing mapper. Mapping quality is not compromised by using AIGs compared with other network representations [14] .
We conduct our research using AIGs within the ABC framework and we propose new logic element architectures and a mapping approach. Our logic elements contain structures beyond LUTs and experimental results demonstrate the area and performance benefits of the proposed logic elements. It is worth mentioning that a few recent works also studied technology mapping into non-LUT structures. Ling et al. used satisfiability (SAT)-based techniques for mapping into blocks with LUTs and gates [15] . Recent work from Actel used cut-based techniques to map into a logic block architecture with gates, and then applied Boolean matching to filter cuts that could not be legally mapped to the target block [16] . Both of these prior works considered the mapping problem in isolation, and not from the architecture evaluation perspective.
III. LOGIC ELEMENT ARCHITECTURE AND MAPPING
Our architecture and technology mapper take advantage of the AIG representation of logic functions. In particular, the proposed logic element architecture relies on the property that only AND gates and inverters can appear in the graph. We introduce the proposed architecture using the example AIG shown in Fig. 4(a) . Two four-input cuts are shown: and , corresponding to LUTs implementing the functions and , respectively. Both and are four-feasible cuts. However, a key observation can be made regarding and in Fig. 4(a) . Looking first at , observe that one of the inputs to the cut is the output signal of gate , and that signal is also a direct input of gate (the root node). Since gate is an AND gate, we know that when the output of is logic-0, then the output of must necessarily be logic-0. Conversely, when the output of is logic-1, the output of is the output of gate (complemented), which in this case is only a three-input logic function. Hence, for the case of , even though the cut is four-feasible and represents a logic function of four variables, we do not need the full flexibility of a 4-LUT in hardware to realize the function. In fact, we can realize the function using the logic element shown in Fig. 5 (a), comprising a 3-LUT and a single AND gate-an extended LUT. Since the AIG subject graph contains only twoinput AND gates with optional edge complementation, we need not be concerned with gates other than AND appearing in the input circuit graph. In essence, we are using a property of the synthesis graph to inspire our logic element architecture.
Turning now to in Fig. 4 (a), one can see that the same observation also applies: if either or is logic-0, then the output is also logic-0. In this case, however, none of the cut inputs are also inputs to the root AND gate. Yet again, we do not need the full power of a 4-LUT to express the function of . Regarding , observe that the "gating" property does not hold for all of the inputs to the cut, for example, if input is logic-0, we cannot determine whether will be logic-0 or logic-1, as it will depend on the values of the other cut inputs. It is worth mentioning the relationship between gating inputs to a function and unate inputs, which are well-described in the literature [17] . Consider a function with an input variable .
is said to be a unate input to if and only if 's sum-of-products (SOP) representation contains either or , but not both. If is a unate input and 's SOP representation contains (in true form), then a transition on can only cause a transition on in the same direction. On the other hand, if 's SOP representation contains only , then a transition on can only cause a transition on in the opposite direction. Certainly, gating inputs to a function are unate inputs; however, the converse is not necessarily true: a unate input may not necessarily be a gating input.
A. Mapping Approach: Non-Inverting Paths in the AIG
The core of our approach is to restrict cuts with inputs to those that resemble the cuts in Fig. 4(a) . The defining feature of such cuts is the presence of a non-inverting path from at least one of the cut inputs to the root of the cut. Some examples are shown in Fig. 4(b) . In this case, when any of inputs , or is logic-0, root node 's output must be logic-0. Observe that the edge crossing the cut may be a complemented edge, as is the case for in the figure. However, edges along the path from the cut "frontier" nodes 1 to the root must be non-inverting. It is a straightforward process during cut generation to traverse the graph downward from the root and determine whether -input cuts have at least one non-inverting path to a cut input. Fig. 4 (c) gives an example cut with no non-inverting path from any of its inputs.
Restricting cuts with inputs to be those that contain non-inverting paths will produce mappings that can be accommodated in an architecture with extended -LUTs, which require about half the silicon area of -LUTs. The extension to which we refer is the presence of an AND gate on the LUT output, as shown in Fig. 5(b) . The other input to the AND gate can be programmably connected to either the true or complemented form of an input signal . The optional inversion is needed to handle the case of complemented edges crossing the cut, such as in Fig. 4(b) . The restriction that cuts have non-inverting paths is only imposed for cuts with inputs; cuts that use less than inputs remain unrestricted. When the logic element in Fig. 5 (b) is used to implement functions that require less than inputs, we assume that the input is tied to either VCC or ground and that the multiplexer select SRAM cell is set such that the AND gate is bypassed. We believe this to be a reasonable assumption, as unused logic block inputs are common in FPGA designs and commercial FPGAs contain circuitry to tie unused inputs to a known logic state.
The obvious question that arises is: What is the impact of restricting the cuts of size from the # of logic elements and speed (depth) perspectives? Surprisingly, as we will demonstrate in our experimental study, our mapping approach and logic element achieve much of the benefits of -LUTs, while consuming much less area.
To define our approach formally, let be the set of cuts for a root node that use less than inputs, as computed using the standard merging procedure described in Section II s.t.
(1) Fig. 6 . Example AIG cut from benchmark circuit alu4 with a controlling input i4 that causes the function to evaluate to logic-1.
and let be the set of -input cuts of that contain a non-inverting path to one of the cut inputs
where is a path in the AIG from a cut input to the cut root . If directly drives , then the path is a single edge and is a non-inverting path. Otherwise, there must be intermediate nodes on the path from to and without loss of generality, we can represent as a sequence of AIG edges (
As described in Section III-A, for path to be called noninverting in (2) , all of the edges on the partial path from to must be uncomplemented, i.e., the edges must be true edges. The edge crossing the cut may be true or complemented.
Finally, the set of filtered cuts that will be considered for a node in our technology mapper is (4)
B. Identifying Additional Gating Inputs
While the discussion above centers on identifying a LUT input that cause the LUT's function to evaluate to logic-0, there may also exist easily identifiable LUT inputs that cause a function to evaluate to logic-1. An example case is illustrated in Fig. 6 , which shows a cut from one of the benchmark circuits used in our experimental study (alu4). The logic function implemented by the cut is:
. An inspection of the AIG reveals that no input to the cut has a non-inverting path to the root-no single input can cause the function to evaluate to logic-0. Applying De Morgan's law to the two clauses in the cut's Boolean function, we attain the function in conjunctive normal form:
.
In this form, we see by inspection that input is a gating input to the cut: When is logic-0, function evaluates to logic-1. Observe in the AIG that there are reconvergent paths from input to the cut root . Though the cut in Fig. 6 does not contain an non-inverting path, it does indeed have a gating input. We again do not need the full power of a 6-LUT to implement the function.
The block architecture shown in Fig. 7 is capable of handling both cases where a gating input causes the function to evaluate either logic-0 or logic-1. It has approximately the same silicon area as the block in Fig. 5(b) , 2 with the key change being that the two-input AND gate in Fig. 5(b) is replaced with a 2-to-1 multiplexer, , in Fig. 7 . One of 's data inputs is received from the LUT; its second data input is received from an SRAM configuration cell. The SRAM configuration cell is configured according to whether the gating input causes the function to evaluate to logic-0 or logic-1. As before, multiplexer permits the input's gating state to be either logic-0 or logic-1. The architecture in Fig. 7 is also referred to as an extended LUT, however, in this case, the LUT is extended with a MUX instead of an AND. A straightforward extension of the mapping approach outlined above can be used to identify cut inputs that cause a function to evaluate to logic-1. Let be the root gate of the cut under consideration; let and represent 's fan-ins; and, let be the cut input we wish to analyze. Input is a gating input that causes the cut function to evaluate to logic-1 if the following conditions are met.
• The fan-in edges of are inverted.
• There are non-inverting paths from to , and from to . The non-inverting paths from cause and to evaluate to logic-0 when is in a particular logic state (either logic-0 or logic-1). In essence, we seek partial non-inverting paths from a cut input to the fanin nodes of the cut's root node, with the added requirement that the root's fanin edges be complemented. Note that the logic element in Fig. 7 can also accommodate cases where a gating cut input causes the cut root to evaluate to logic-0. Such cases can be discovered using the approach outlined in Section III-A above, namely, finding a single non-inverting path from an input to the root.
In our experimental study, we consider both AND-extended LUTs [see Fig. 5(b) ] as well as MUX-extended LUTs (see Fig. 7 ), and we show that the added flexibility afforded by the MUXextended LUT provides modestly better performance and area results. 3 
C. Generalized Architectural Families: Extended LUTs
Having considered two classes of gating inputs to LUTs, we now broaden the scope to consider cases wherein there are multiple gating input signals. The AND and MUX-extended LUT logic element architectures described above (and shown in Figs shows a 5-LUT with two cascaded AND gates. We characterize such logic element architectures in a general form as -AND-extended LUTs and -MUX-extended LUTs. For example, a -AND-extended LUT contains a four-input lookup table, followed by two cascaded AND gates. The generalized forms are depicted in Fig. 9 . Our experimental study considers a wide range of logic element architectures that fall into these generalized logic element families.
We envision that LUTs extended with other types of gates, for example an exclusive-OR-extended LUT, may also prove useful, however, mapping circuits into such architectures is not straightforward and cannot be achieved through a simple traversal of the AIG representation. Fig. 10 gives an abstract view of an FPGA and illustrates the proposed architectural change. The FPGA itself is a 2-D array of tiles with programmable logic and routing resources. The figure shows that in addition to LUTs, tiles contain other logic, for example fast carry chain arithmetic logic and storage elements (programmable flip-flops). We propose to replace the -input LUTs with alternative logic elements that also use inputs, namely the extended LUTs that use considerably fewer transistors (and less area) than -LUTs (illustrated on the lower-right of Fig. 10) . The arithmetic and other logic surrounding the LUTs can remain unchanged in the proposed architecture.
D. Overall Architecture
Regarding carry logic, in Xilinx FPGA families such as Virtex-5 and Virtex-6, coupled with each 6-LUT is a 2-to-1 carry chain multiplexer driven by the LUT output. The multiplexer realizes the carry generate/propagate functionally in carry look-ahead addition. Specifically, for the addition of two bits and , the 6-LUT is used in dual-output mode, where one of the two outputs produces the propagate function and the second output produces the generate function . Two functions of two common inputs can be realized in a dual-output 3-LUT and consequently, as long as the proposed extended LUT contains at least a 3-LUT, carry arithmetic can be handled identically to today's commercial chips. Note also that in some commercial FPGAs, the SRAM cells in the LUTs can be used to implement small memories and/or shift registers. Such functionality is also possible using the proposed extended LUTs, albeit with fewer SRAM bits available.
Since the original -LUTs and the proposed elements use the same number of inputs, it is expected that they will exert a similar demand on the FPGA's programmable routing fabric. The equivalent pin demand implies that the new logic elements can be interchanged with the orignal LUTs, while the programmable routing fabric remains constant. In other words, the new logic elements do not necessitate a change in the FPGA's routing fabric-a property that makes them fairly straightforward to incorporate into an existing commercial architecture. Fig. 11 illustrates the typical FPGA design flow, comprising HDL and logic synthesis, technology mapping, placement, routing, and finally, bitstream generation. Only the technology mapping phase needs to be specialized for the proposed architectural change. No changes are necessary for the other phases of the flow, e.g., changes to placement and routing. Supporting the proposed architecture is relatively low-cost from the tools' standpoint.
E. CAD Implementation
Mapping circuits into logic elements with multiple cascaded gates can be achieved through repeated application of the techniques described above, with the added requirement that finding more than one non-inverting path in the AIG may be necessary. For example, to map a seven-input logic function into the architecture of Fig. 8 , we must identify two gating inputs. Mapping a function into a logic element with a cascade of MUX gates is also straightforward. For example, we wish to evaluate whether a six-variable function can map into a logic element with a 4-LUT and two cascaded MUX gates. We first identify a gating input to the function using the approach described in Section III-B. If that is successful, we are left with a fivevariable function that is a factor of (variable is factored out). 4 We then use the same procedure to search for a gating input to . If such an input to can be identified, then function can be realized in the logic element. Note that the structure of cascaded gates in Fig. 8 is for illustration/clarity only-two cascaded AND gates can be implemented more efficiently in CMOS as single larger AND gate, rather than multiple serially connected small gates-a two-input AND gate requires six transistors (a two-NAND followed by an inverter); a three-input AND gate requires eight transistors (a three-NAND followed by an inverter). Note that the CAD and architectural perspectives of the proposed logic elements can be separated from the circuit-level details: From the point of view of the technology mapping, knowing the and parameters along with the logic element style (AND or MUX) are sufficient to produce a legal mapping.
In cut-based FPGA technology mapping to -LUTs, for a node with nodes in its transitive fan-in cone, there are at most potential cuts [18] . For each such cut, performing the check for non-inverting paths requires, in the worst case, traversing the entire fan-in conetime. Hence the overall complexity for mapping is , which is polynomial time as is a fixed constant. In general, we did not observe any appreciable increase in mapper run-time for targeting the proposed architectures versus targeting a traditional LUT-based architecture. The placement and routing steps are, by far, the most compute-intensive phases of the FPGA CAD flow.
We map circuits into the proposed architectures using a modified version of the ABC technology mapper based on priority cuts [14] . Our modified mapper can operate in one of the following two ways. 1) Hard: When the technology mapper is generating the set of -feasible cuts for a node, we ignore all cuts that cannot be accommodated in the architecture being targeted. The mapping solution produced is therefore guaranteed to contain only logic element instances that fit into the target logic element architecture. 2) Soft: We do not ignore any of the -feasible cuts generated for any node. Rather, we change the way cuts are ranked by the priority cuts mapping algorithm. Specifically, we use the mapped depth as the primary criterion for ranking cuts, and as a secondary criterion, we prefer to choose cuts that legally fit into the target logic element architecture. The purpose of the "soft" flow is to evaluate, for a -LUT-based logic element architecture, how many LUTs in the mapping solution need to be full -LUTs if optimal mapping depth is to be While the AIG circuit representation was the inspiration for our proposed architectures, its use is not required to identify gating inputs, nor required to target the proposed logic element architectures. Consider, for example, the standard sum-of-products (SOP) and product-of-sums (POS) representations of logic functions (as an alternative to AIGs). Any input in a function's SOP representation that is present in all of its product terms in one polarity (either true or complemented) is a gating input that can force the function to logic-0. Likewise, any input in a function's POS representation that is present in each of its sum clauses in one polarity is a gating input that can force the function to logic-1. Hence, while AIGs offer a convenient way to find gating inputs-through non-inverting paths-it is also easy to identify gating inputs for functions SOP/POS form. We believe it to be straightforward to modify any cut-based FPGA technology mapper to target the proposed extended LUTs, by filtering candidate cuts that do not meet the gating requirements.
IV. EXPERIMENTAL STUDY
A. Methodology
We use the mapper in [14] as the baseline mapper to which we compare. The baseline mapper was executed in depth mode, which achieves the minimum depth mapping and then performs area-driven post-passes based on the area-flow concept [19] .
The technology-independent transformations applied to circuits prior to technology mapping have considerable impact on mapping results. Multiple technology-independent transformation scripts are included with the ABC package. Prior to technology mapping, we applied the resyn2 script. We also investigated using the compress2 script, but found it produced slightly worse depth results, on average. For mapping into extended -LUTs, we altered the area-driven post-passes in ABC to ensure that they produced mappings compliant with the logic element architectures targeted.
We borrow the approach of [5] and use two different sets of benchmark circuits in our experimental study: 1) the 20 combinational and sequential circuits commonly used in academic FPGA CAD and architecture research and 2) the 13 largest circuits from the widely used VPR 5.0 circuit set [20] . We used Altera's Quartus 9.1 tool to synthesize the VPR 5.0 circuits from Verilog to BLIF. Altera's QUIP (Quartus University Interface Program) flow [21] was used to produce BLIF for each circuit following HDL elaboration and technology independent synthesis. For all architectures and circuits considered, technology independent optimization (using resyn2) and technology mapping was executed 6 times and the best result achieved is reported. A similar multi-pass methodology was applied in [22] .
B. Results
We present two sets of results. We first present results for six-input logic element architectures. Such logic elements could be directly interchanged with the 6-LUTs in a modern commercial FPGA, such as the Xilinx Virtex-5. Changes to the interconnection fabric would not be required, as the fabric is already designed to handle the routing demand imposed by six-input logic elements. Subsequently, we present results for seven-input logic element architectures. Early commercial FPGAs (in the 1980s and 1990s) used 4-LUTs and the recent trend has been towards larger LUTs. In the future, we may well see commercial architectures with seven-input elements, and consequently, it is desirable to evaluate area/performance tradeoffs for seven-input logic elements. Table I gives results for mapping circuits into 6-LUTs (the baseline), 5-LUTs, and six different six-input logic element architectures:
-AND, -MUX, -AND, -MUX, -AND, -MUX. Recall that -AND and -MUX architectures contain an -LUT followed by a cascade of gates (AND or MUX). Hence, the architectures presented in the table use progressively less silicon area as the table is read from left to right. With the exception of the 5-LUTs, all architectures in the table require six inputs and hence, they could all be embedded into similar FPGA routing fabric that consumes a fixed amount of silicon area. For each architecture and circuit considered, both the depth of the mapped network (labeled "DEP") and the number of logic elements are given (labeled "#LEs"). The top half of the table shows results for the 20 circuits most commonly used in FPGA research; the bottom half of the table shows results for the VPR 5.0 circuits. We observed markedley different results for the two different benchmark circuit sets and therefore, we decided to give geometric mean results for each TABLE I  RESULTS FOR SIX-INPUT LOGIC ELEMENT ARCHITECTURES AND 5-LUTS circuit set, as well as for all of the circuits together (last rows of the table).
We first consider results for 5-LUTs versus 6-LUTs (see the "6-LUTs" and "5-LUTs" columns of Table I ). For the 20 standard benchmarks, 5-LUT mapping solutions are 12% deeper and use 15% more LUTs than 6-LUT mapping solutions. For the VPR 5.0 circuits, 5-LUT mappings are 18% deeper and use 13% more LUTs than 6-LUT mappings. The VPR 5.0 circuits are more sensitive to changes in the number of LUT inputs versus the 20 standard circuits commonly used in FPGA research. In general, while more 5-LUTs are needed than 6-LUTs to implement a circuit, a 5-LUT requires just half the silicon area of a 6-LUT. On the other hand, 6-LUTs deliver a considerable depth reduction over 5-LUTs (14% across all circuits in both benchmark sets) which is the prime reason that commercial FPGA vendors have trended to 6-LUTs recently in their high performance product lines.
Moving onto the proposed -AND and -MUX architectures, observe that across all circuits (bottom rows of Table I ), relative to 6-LUTs, the -AND architecture increases depth by 6% and the -MUX architecture increases depth by 5%. As compared with 6-LUTs, the number of logic elements is increased by 9 and 7% for the -AND and -MUX architectures, respectively. Both of the proposed architectures require silicon area close to that of a 5-LUT, yet they both deliver most of the depth benefit of 6-LUTs. Moreover, not as many of the extended 5-LUTs are needed to implement circuits versus pure 5-LUTs. On the depth and logic element count axes, the added flexibility offered by the MUX architecture provides slightly better results than the AND architecture.
It is worthwhile to examine the dependence of the results on the benchmark set. For the -MUX architecture, mapping depth is 3% higher than 6-LUTs, on average, for the standard 20 benchmarks. However, for the VPR 5.0 benchmarks, mapping depth is 9% higher than 6-LUTs, on average. The results demonstrate that the choice of benchmark set can have a significant impact on both architectural conclusions as well as the perceived efficacy of CAD algorithms. While it remains unclear which of the two benchmark sets is more representative of the universe of all circuits, the VPR 5.0 circuits appear to carry a higher "richness" in their logic functions and exact a higher demand on the underlying logic element architecture.
The -AND and -MUX in Table I contain a 4-LUT with two cascaded gates. The gap in mapped depth between these two architectures is wider than in the case. Relative to 6-LUTs, the -AND architecture increases depth by 20% and logic element count by 21%, whereas the -MUX architecture increases depth by just 10% and logic element count by 19%. The MUX architectures can accommodate a wider range of logic functions. Fig. 12 illustrates a cut that can be implemented in a -MUX architecture yet cannot be implemented in a -AND architecture. In the example, input has a non-inverting path to the root. With factored out, the remaining function is:
, in which is a gating input whose logic-0 state causes to evaluate to logic-1. While is not a gating input to the overall function , it is indeed a gating input to function , revealed only after is selected as a first gating input to . The right-hand side of Fig. 12 shows how the signals map to pins of the logic element architecture. A last observation in relation to the architectures is that the -MUX architecture actually produces mapping solutions having smaller depth than 5-LUTs, despite the fact that the solution uses half of the area of a 5-LUT. The data in the right-most columns of Table I for the  -AND and -MUX architectures is included for completeness. Such architectures increase depth by over 34% percent versus 6-LUTs, on average. We do not believe such a performance loss would be acceptable in a future commercial architecture. Table II shows the results for 7-LUTs (baseline), 6-LUTs, and a variety of seven-input logic element architectures. Looking at the last rows of the "6-LUTs" columns, we see that 6-LUT mappings are 22% deeper than 7-LUT mappings and require 9% more LUTs, on average, across all circuits. For both circuit sets, the depth advantage of moving to 7-LUTs from 6-LUTs is larger than that observed for moving to 6-LUTs from 5-LUTs. We expect that some modern commercial designs may be highly pipelined and therefore more "shallow" than the circuits considered here. Highly pipelined circuits may exhibit less dependence on LUT size.
The architectural trends observed in Table II are similar to those in Table I . As before, we observe that most of the depth benefit of moving from 6-LUTs to 7-LUTs can be achived with the -AND and -MUX architectures, with the MUX architecture providing slightly better results. On average, across all circuits, the -MUX architecture offer mappings that are 8% deeper than 7-LUTs and use 7% more logic elements. This can be compared with 6-LUTs, which have roughly the same silicon area, yet whose mapping depth is 22% higher than 7-LUTs. As was the case in Table I , we observe a pronounced difference between the two circuit sets. In general, the standard 20 benchmarks are considerably less sensitive to the target element architecture.
The data in Table II suggests that if vendors added an MUX gate to their 6-LUT outputs and then mapped to such extended 6-LUTs, depth would be cut by about 14%, on average. In so doing, the 6-LUT-based blocks would need additional logic block inputs to provide a signal to the MUX input, possibly impacting routing demand. However, the Xilinx Virtex-5, for example, already has extra inputs on its logic blocks (e.g., the bypass inputs), which could perhaps be made dual-usage for driving the MUX gate. Table III gives the approximate hardware cost of the key logic element architectures considered in this paper, accounting for the cost of the cascaded gates. For each architecture, we list the # of SRAM configuration cells (including cells in LUTs, cells to control optional input inversion, and cells to feed data inputs on multiplexers), the # of 2-to-1 multiplexers, 5 and the number of inputs to the logic element. We have assumed that LUTs are implemented using a tree of 2-to-1 multiplexers, as shown in Fig. 1 . The right-most columns of Table III give the ratio of the # of SRAM cells and # of multiplexers to baseline 6-LUTs. Such ratios represent the approximate hardware area cost of each architecture versus 6-LUTs. We stress that the ratios are approximate as we have not, for example, included any buffer costs and expect that large LUTs implemented as multiplexer trees contain repeaters at intermediate tree nodes. Likewise, we have not included transistor sizings, which we expect to be vendor and device specific. In general, the data in Table III support the observation that logic element logic area is dominated by LUT area, with the LUT dominance decreasing as gates are successively added in cascade to the LUT output. For example the -MUX architecture is estimated to consume 52% of a 6-LUT's area.
C. Die Area and Delay Impact
Using the data in Table III and in the results Table I , we can make a coarse estimate of the overall improvement in logic density. Using an extended 5-LUT, such as the -MUX architecture, instead of a 6-LUT will reduce the tile area needed for a logic element by roughly 50%. Smaller tiles will reduce wirelengths, interconnect capacitance and delay. As shown in Fig. 13(a) , we estimate that in a CLB, such as the Xilinx Virtex-5 FPGA, the interconnection fabric (and its configuration circuitry and SRAM cells) consumes 50% of the tile layout area; the eight 6-LUTs in a Virtex-5 CLB (and their SRAM cells) consume 30% of the tile; and flip-flops and other circuitry comprise 20% of the tile. Fig. 13(b) gives an estimate of the tile area when the eight 6-LUTs are replaced with eight extended 5-LUTs. We assume LUT area is halved, and therefore total tile area is reduced by 15% and LUTs now comprise about 17.5% of the tile. This implies that if the original tile area were 1 unit , as in Fig. 13(a) , the new tile area would be 0.85 units . Results in Table I demonstrate that 7% more extended 5-LUTs are needed versus 6-LUTs to implement circuits. Consequently, logic density in silicon will scale by , which is roughly a 9% improvement in logic density versus 6-LUTs. In other words, a given logic circuit would require 9% less silicon area if the proposed architecture is used.
Assuming a square tile layout, the tile dimensions are reduced from 1 1 to 0.92 0.92, as shown in Fig. 13(b) . Thus, the -dimension and -dimension have each been reduced by about 8%. Metal wire capacitance would be reduced accordingly, mitigating the higher logic depth associated with extended 5-LUTs. Recognize that a fraction of interconnection capacitance is metal capacitance and fraction is switch capacitance (capacitive load due to routing switches attached to metal wire segments). Switch capacitance is unaffected, so we cannot assume that interconnect delay will be reduced by 8%. Nevertheless, the tile size reduction bodes well for the practicality of the proposed logic block.
To further validate our results, we used VPR 5.0 [20] to pack, place and route circuits into logic blocks containing eight 6-LUTs and flip-flops. The cluster size of eight matches closely with Virtex-5 and Stratix-III FPGAs, whose logic blocks contain eight and ten 6-LUTs, respectively. A simple routing architecture with unidirectional length-4 wire segments was used. The circuits mapped into pure 6-LUTs were placed and routed and the minimum number of tracks per channel, , needed to route each circuit was determined. Then, both the baseline and experimental (extended 5-LUT) mapping solutions were packed, placed and routed into an architecture with tracks per channel. That is, routing architecture was held invariant between the baseline and experimental routing solutions. Each circuit was placed and routed three times with different placement seeds and the minimum critical path delay across the three runs was determined for each circuit. On average, critical path delay was 6% worse with the extended 5-LUTs, which concurs reasonably with the depth results given above. Note that 6% is a conservative upper bound on the performance hit, as it does not include the benefit of smaller tiles and reduced capacitance provided by the extended 5-LUTs.
D. Architectural Analysis
Finally, we did a preliminary architectural investigation of the value of heterogeneous logic blocks. We posed the question: If 6-LUT optimal depth must be achieved, how many of the LUTs need to have the full functionality of a 6-LUT versus how many can be implemented using extended 5-LUTs, i.e., the -MUX architecture? The results of this analysis are shown in Table IV . The left side of the table shows results for the standard 20 benchmark circuits; the right-hand side of the table gives results for the VPR 5.0 circuits. For each circuit, two percentages are given. The first percentage, in the "ABC mapping" column shows the fraction of LUTs in mapping solutions produced by the baseline mapper [14] that require the full functionality of a 6-LUT (and could not be implemented using an extended 5-LUT). The second percentage, in the "Alternate mapping" column, gives results for the the mapping approach described in Section III-E that prefers to use extended 5-LUTs, but but does not impose hard restrictions and will not use extended 5-LUTs if mapping depth is compromised. These mapping solutions have the same optimal depth as the mapping solutions of the baseline 6-LUT mapper.
The results in Table IV show that even using the baseline mapper, only 12% of LUTs need the full functionality of a 6-LUT to achieve optimal depth. Note that for this work, we used a more recent version of ABC than was used in [6] . The mapper in the new version of ABC incorporates the WireMap algorithm described in [23] , which tends to produce fewer LUTs that use all six inputs. With the alternative mapping, we observe that only 1.6% of LUTs need to be full 6-LUTs to achieve optimal depth. The data in Table I revealed that in most cases, optimal depth can be achieved without any pure 6-LUTs. Yet, observe that no circuit has a value of 0 in the "Alternative mapping" column of Table IV . This is due to our cost function that only prefers to use extended 5-LUTs, and is therefore heuristic.
In summary, we suggest that a heterogeneous architecture with a fraction of pure 6-LUTs and a fraction of extended 5-LUTs may be viable. Very few pure 6-LUTs are needed in the architecture, perhaps 5% at most.
V. CONCLUSION AND FUTURE WORK
We proposed a family of FPGA logic element architectures inspired by the AIG network representation used in modern logic synthesis research. The logic element is an extended LUT, which contains a -LUT along with cascaded AND or MUX gates on its output. Results show that that a -MUX extended LUT provides performance close to a 6-LUT, yet has silicon area close to that of a 5-LUT. We believe our work should keenly interest commercial vendors whose logic blocks are based on 6-LUTs. Higher logic density can be achieved by exchanging some or all of the 6-LUTs with extended 5-LUTs, with little negative impact on circuit delay.
It is worth recalling an early work published in 1992 by Chung and Rose that considered mapping circuits into multiple LUTs that were hard-wired together in specific configurations [24] . One sample architecture considered in that work was two cascaded 4-LUTs-the output of one LUT hard-wired to an input of a second LUT. The observation that modern FPGAs do not incorporate such hard-wired LUTs is perhaps reflective of the difficulty in mapping to such architectures. In our work, the logic element architecture is driven by the netlist representation which greatly simplifies mapping.
Finally, in this work, mapping was performed directly on netlists produced by technology independent transformation scripts. Future work will involve exploration of technology independent transformations to encourage creation of netlist topologies that can be accomodated by the extended LUT element.
