Abstract-This paper explores the effect of logic block architecture on the speed of a field-programmable gate array (FPGA). Four classes of logic block architecture are investigated: NAND gates, multiplexer configurations, lookup tables, and wide-input AND-OR gates. An experimental approach is taken, in which each of a set of benchmark logic circuits is synthesized into FPGA's that use different logic blocks. The speed of the resulting FPGA implementations using each logic block is measured. While the results depend on the delay of the programmable routing, experiments indicate that five-and six-input lookup tables and certain multiplexer configurations produce the lowest total delay over realistic values of routing delay. The primary reason is that these blocks can implement typical logic using the fewest levels of logic blocks, and thus incur a small number of stages of the slow programmable routing present in all FPGA's. The secondary reason is that their inherent combinational delay is not excessive. The fine grain blocks, such as the two-input NAND gate, exhibit poor performance because these gates require many levels of logic block to implement the circuits and hence require a large routing delay.
I. INTRODUCTION HE field-programmable gate array (FPGA) is a new
T ASIC medium that provides instant manufacturing turnaround and extremely low prototype manufacturing costs. An FPGA can be designed like a mask-programmed gate array (MPGA) but is user-programmable like a programmable logic device (PLD). The user-programmability, however, causes an FPGA to have both lower logic density and lower performance than an MPGA that is made in the same process technology. These deficiencies can be addressed by improving the architecture of the FPGA, which consists of its logic block function, interconnection structure, and I/O block design. In previous work, we have investigated the effect of logic block functionality on the area-efficiency of FPGA's [22] , and the effect of switching block flexibility on the routability of an FPGA [23] . In this paper we look at the effect of logic block architecture on FPGA performance.
The FPGA was introduced in [8] . Since then newer verManuscript received August 8, 1991; revised November 4, 1991 . This work was supported by NSERC under Operating Grants URF0043298, A4029, and OGP0036648, a MICRONET research grant, a research grant from Bell-Northern Research, and ITRC.
S. Singh There are many different kinds of interconnection structures, such as those articulated in these commercial architectures and in [23] . It is universally true, however, that the delay of the routing is significantly greater than that of a simple metal wire in the same process technology because programmable interconnects contain significant resistance and capacitance. For example, in [15] connections are made with pass transistors with 1-to 2-kQ resistance, and in [l] connections are made with 300-to 500-Q antifuses. As a result, connection delays often exceed the delay of the logic block, and this is one of the fundamental limitations on FPGA speed.
The performance of an FPGA can be increased by reducing the number of stages of programmable routing used in the critical paths. One way to do this is to use logic blocks with high functionality so that the number of logic block levels in the critical path is minimized, as illustrated in Fig. 1 . Fig. l(a) gives the implementation of the logic function f = ab2 + abc + ac2 using a two-input NAND gate as the logic block. It requires four levels of the logic block in the critical path. Fig. l(b) shows an implementation of the same function using three-input lookup tables, which requires only two levels. Since the latter avoids two levels of slow programmable interconnect, this will likely lead to a significant decrease in delay. Increasing the functionality of the logic block, however, is likely to increase its combinational delay. This increase is only profitable if the reduction in routing delay more than offsets the increase in total delay due to the logic block. In this paper, an empirical approach is taken to study the effect of the logic block functionality on the total delay of an FPGA. We seek a logic block that minimizes the total delay in an FPGA. The experiments presented in this paper indicate that five-and six-input lookup tables exhibit the lowest total delay over a set of logic circuits for the important values of routing delay, and the multiplexer-based block used in [ 113 is close behind. The NAND gates, on the other hand, give the largest total delay, while the delay of FPGA's based on wide-input AND-OR gates is between these two. There is one major caveat in these experiments; the answers depend heavily on the quality of the logic synthesis tools used to generate them. In all cases the best tools available were used; this does not preclude the possibility that a better tool for a given logic block would improve that block's result.
This paper is concerned only with the speed of the FPGA and so we ignore issues that affect the logic density. While area has an indirect effect on speed, the assumption is that CAD tools can be optimized to reduce that effect. Previous work [22] has addressed the issue of density. An earlier version of this research appeared in [24] , and an extended version appears in [25] . A similar .study, which focuses on lookup tables and performs indepth implementations to the full place-and-route level, appears in [ 161. This paper is organized as follows. Section 11 describes the selection of logic blocks investigated, the experimental procedure, and the delay model. Section I11 presents the experimental results, while Section IV outlines the conclusions and the relevant future work.
EXPERIMENTAL CHOICES, PROCESS, AND MODEL
To compare the different logic blocks for their etfect on the speed of an FPGA, our approach is to synthesize a set of circuits into many FPGA's. Each circuit is synthesized into a number of different FPGA's, where each FPGA uses a different basic logic block. The delay of the resulting implementation is then measured. The results, summarized by logic block, give an indication of the effect of logic block choice on the speed of an FPGA.
By "synthesize" we mean that a circuit passes through the logic synthesis necessary to transform a logic description of the circuit into an optimized network consisting of connections between one kind of logic block. By "measuring" the delay of the synthesized network, we mean the determination of the critical path length in each FPGA implementation and then estimating the total delay using a model.
The following section discusses the selection of logic blocks used in these experiments. Subsequent sections relate the synthesis procedure and the delay modeling.
A . Logic Block Selection
To represent a wide cross section of the possible blocks, four classes of the logic block were selected for comparison: NAND gates, multiplexers, lookup tables, and wideinput AND-OR gates. Table I gives the name and description of the logic blocks chosen from each class. The four classes are described below.
I) NAND Gates: The nand2, and nand3, and nand4 gates are two-, three-and four-input NAND gates, respectively. The nand2pi, nand3pi, and nand4pi are the corresponding NAND gates that have a programmable inversion capability, which allows the inputs to be passed in true or complement form. These were chosen because several FPGA's have been proposed which use NAND or AND gates [9] , [19] , [20] , and this is a similar level of granularity to that used in MPGA's.
The NAND gates were implemented using standard CMOS techniques [25] . The programmable inversion was performed with an inverter and a pass gate [25] .
2) Multiplexers: The mux21 and mux41 logic blocks can implement all possible logic functions of a multiplexer by connecting the primary inputs and the selector inputs to either constants (0 or 1) or signals. The muxA logic block is the one used in [ 1 11, which 4) AND-OR Gates: These gates perform a two-level AND-OR logic function. We use the notation AxOy pi to describe each gate, where x is the total number of inputs that can be selected to form y separate product terms. Each of the y product terms is oRed together in the logic block to generate the output. For example, A803pi has a total of eight inputs, each of which can be selected to form three separate product terms that are oRed together. These gates have the programmable inversion capability. Table I also gives the worst-case delay of each logic block determined using the SPICE 2G6 circuit simulator [27], in a 1.2-pm CMOS process. The simulation includes a small buffer following the logic function, but no loading or delay due to routing.
[3l, V I , WI, and 1261.
B. Logic Synthesis Procedure
Logic synthesis is required to convert each test circuit into a network of logic blocks, while minimizing the number of logic stages between the primary inputs and the output of the circuit. The procedure employed is described below. Note that this procedure deals only with combinational circuits, as we assume that the sequential and combinational portions of the circuits have been separated.
1) Collapse the logic circuit into a two-level representation, using the MIS I1 [7] collapse command so that each output is only a function of its primary inputs. Optimize the two-level expression using Espresso [6].
2) Factor each output separately into a multilevel logic expression using the mid1 decompose command [7] . This may result in logic that is used in more than one expression, and so we say that it createsfan-out. We then remove this fan-out by replicating the logic expression that is the source of the fan-out. This step is necessary because most technology mapping approaches, including some of those used in step 3, do a poor job across fan-out. By removing fan-out the delay is reduced but the area is increased, and as such the results presented below are optimistic. It is possible that better CAD tools would be able to operate on networks with fan-out and achieve similar results. Note also that without fan-out some circuits, such as a parity tree, have a size that is exponential in the number of inputs. It was thus not possible to run this class of circuits in these experiments.
3) Perform the technology mapping of the Boolean network. This converts the Boolean logic expressions into a network of logic blocks. The best available technology mapping tool was used for each class of logic block, as described below. a) NAND gates and multiplexers: For these logic blocks the technology mapping is done using the MIS 2.2 technology mapping program, which is the most recent version of the one presented in [ 101. The mapper is set to optimize the critical path delay. It requires a library to be generated for each logic block that describes all possible logic functions that the block can perform. In all cases, a complete library was constructed ( 
C. Model for Measuring Delay
The speed of a circuit implemented in an FPGA with a given logic block is a function of the combinational delay of the logic block (DLB), the number of logic blocks on the critical path (NL), and the delay incurred in the routing between each logic block (DR). Assuming that each stage of logic block incurs one routing delay and one logic block delay, then the total delay ( D T O T ) can be calculated as (1) The value of NL can be measured for each circuit after it is mapped into a logic block using the procedure described above. The value of DLB was determined as described in Section II-A.
The value of D R is much more difficult to determine. It is a function of the routing architecture, the fan-out of a connection (which would be determined by the physical placement), the length of the connection, the process technology, and the programming technology. Since our purpose is to understand general architectural principles, it is important not to fix any of these parameters. As such, most of the results below will be given as a function of D R , rather than choosing a specific D R .
This assumption, however, makes the approximation that D R is constant for each connection. This is a simplifying abstraction that makes this broad set of experiments possible, but it is inaccurate, and must be considered when any conclusions are drawn. It is comforting to note that in [16] , where complete implementations down to the place-and-route level were performed, results where the experiments overlap are similar to those presented here.
DTOT = NL X (DLB + DR).

EXPERIMENTAL RESULTS
The experimental circuits that were used are a selection of 15 logic synthesis benchmarks provided by the Microelectronics Center of North Carolina (MCNC) and one standard cell-based circuit from Bell-Northern Research. They range in size from 28 to over 700 two-input NAND gate equivalents. Each circuit was passed through the implementation procedure described in Section II-B once for every logic block listed in Table I . Sections III-A through -D discuss the relative performance of the logic blocks in each of the four classes (NAND gates, multiplexers, lookup tables, and AND-OR gates). Section III-E compares the best logic blocks from the four classes.
The comparison of different logic blocks is done by averaging the critical path delays over all the test circuits. Table I1 gives the summarized delay data for NAND gates. The first column names the gate, the second column lists the combinational delay from Table I Table I1 shows that the total delay for the nand3 block, for all ranges of DR, is less than nand2. This occurs because the increase in functionality from nand2 to nand3 results in a lowering of the number of logic blocks in the critical path (K) from 15.2 to 11.8. While the delay of nand3 (0.88 ns) is slightly more than for the nand2 (0.70 ns), the saving in the number of levels more than compensates. Interestingly, this is true for D R = 0, which would correspond to mask-programmed routing. As the routing becomes slower ( D R > 0), the relative performance of nand3 improves over nand2 because each routing stage costs more in delay.
A. NAND Gates
The reduction in total delay from nand3 to nand4 is not as significant as it is from nand2 to nand3. This is because the increase in logic block delay is not offset by a reduction in the total number of logic block levels. Note that no further improvement was achieved with the nand5 gate or any larger NAND gates. Table I1 also indicates that the addition of programmable inversion (the gates suffixed with "pi") to the NAND gate inputs causes a significant reduction in the number of logic block levels. The programmable inversion, however, requires roughly 0.5-ns extra combinational delay which makes such gates slower at DR = 0, as compared to the pure NAND gates. As D R increases, nand;?pi, nand3pi, and nand4pi give increasingly better delay than nand2, nand3, and nand4, respectively. This is because the increased combinational delay is being more than offset by the reduction in the routing delay due to a lower q. For DR > 0, nand3pi and nand4pi give the best per- formance among NAND gates. While nand3pi and nand4pi exhibit almost the same performance, nand3pi has fewer inputs and so it would be the best choice among the NAND gates. This is because the number of inputs has an indirect effect on delay, which is not considered in the delay model: as the number of inputs to a block increases, there are more connections to the block, and hence more parasitic capacitance to be driven. 
B. Multiplexers
The total delay results for three multiplexer configurations are given in Table 111 , which has the same columns as Table 11 . The muxA logic block exhibits the lowest K. This is due to the high number of logic functions that this logic block can perform, several of which have appreciable fan-in. These wider gates are capable of reducing logic depth because depth is roughly logarithmic in the number of inputs, with the base of the logarithm equal to the fan-in of the gate. The combinational delay of the muxA logic block is same as that of a four-to-one multiplexer (mux41), and because it has lower %, it gives better performance for all values of DR. Thus, the muxA block would be the best choice among the multiplexer configurations investigated. Tables  Table IV summarizes the total delay results for K-input lookup tables, with K ranging from 2 to 9. As the number of inputs to the lookup table increases, we observe that the number of logic block levels in the critical path continues to decrease up to K = 9. Significant decreases are obtained as far as K = 8. This occurs because lookup tables can implement any function of K inputs, and so they can contain many levels of logic. Notice that the lookup table inherently has the programmable inversion capability.
As the number of inputs to the lookup table increases, the logic block delay (DLB) increases roughly 0.4 ns for every added input, after K = 3. This is because each added input causes one more transistor to be added in series in the multiplexer tree that implements the lookup table.
For very fast routing delay (DR = 0), the fastest logic block is strictly a function of the number of logic block levels (%) and the delay of the logic block (DLB). As shown in Table IV , for values of K greater than 2, the total delay (DTOT) is almost constant. This says that reduction in delay due to a lower % as K increases above 3 is exactly offset by the increase in delay due to the increase in DLB.
As DR increases, the cost in delay of each logic block level increases, and so the blocks with lower values of K achieve superior performance. For DR = 2 the sixinput lookup table achieves the best performance. For DR = 4 the seven-input lookup table achieves the best performance. Notice, however, that the five-input lookup table achieves similar performance in both cases, and the accuracy of these experiments makes this small spread in- significant. In addition, as noted in Section 111-A, the delay model does not account for the increase in delay due to extra capacitive loading from the higher number of pins, and so the marginal improvement shown in Table IV from K = 5 to K = 7 may be lost. The actual choice of logic block might be more strongly influenced by the fact that each added input doubles the number of bits in the lookup table, and hence the area. Thus, the five-and six-input lookup tables are good choices for DR = 2 and DR = 4 ns, which are realistic values for routing delay.
As DR increases to 10 ns, the best value of K continues to increase, to K = 8. It is clear that as long as increasing K results in a decrease in K , then higher values of K will be faster for higher values of DR. Table V gives the total delay for the wide AND-OR gates. It is clear that for all ranges of the routing delay, the AND-OR blocks with five product terms (Ax05pi) exhibit lower total delay than the corresponding blocks with three product terms (Ax03pi). There is an average of 10 to 15% improvement in delay from three to five product terms. This occurs because the blocks with five product terms have smaller K than those with three product terms, while the increase in DLB from three to five product terms is minor.
D. AND-OR Gates
For fast routing, with DR up to 2 ns, the A405pi block gives the lowest delay as this logic block presents a good balance between combinational delay and functionalityit has 26% fewer average logic block levels than A205pi, and only 15% more combinational delay. Blocks with more than four inputs incur a combinational delay that more than offsets the gain due to a reduction in NL, at the lower values of routing delay. For DR = 4 ns, the A805pi block exhibits the lowest delay, while for DR = 10 ns, the A1605pi block is the fastest. As the routing delay increases, the effect of dominates the total delay since the routing delay per stage is much greater than the combinational delay per stage. However, as discussed in Section 111-A, the delay model in this study gives an advantage to logic blocks with a large number of inputs, and so marginal decreases in delay resulting from a larger number of inputs should be ignored. Thus, the A405pi and A805pi blocks provide the best performance over the realistic values of routing delay. These two blocks will be compared against the best from other categories in the next subsection. Fig. 2 is a plot for the total delay of the best logic blocks from each class versus the routing delay DR. Table VI tabulates the same data. The first clear conclusion from these data is that the fine-grain logic blocks, such as the two-input and three-input NAND gates (even with programmable inversion), exhibit markedly lower performance than any other class of logic block. This is a significant conclusion, given that several commercial FPGA's use the two-input NAND gate as the basic logic block. Notice that the result is true even for a routing delay of zero, which provides an interesting perspective on mask-programmed architectures-they should perhaps use a more coarse-grain basic block, as suggested in [12] . At zero routing delay, the muxA logic block is the fastest because it has a very small combinational delay combined with a low number of logic block levels.
E. Overall Comparison
For the midrange routing delays (2 ns 5 DR I 4 ns) the five-and six-input lookup tables and the muxA logic block exhibit similar delays, with the lookup tables slightly faster. At this point the routing delay is mostly greater than the logic block delay, so the number of logic block levels begins to dominate in the comparison. These blocks have quite low values of K . The wide AND-OR gates, which have K close to the muxA block, exhibit worse performance because of a significantly higher combinational delay.
For large delays (DR = 10 ns) the five-and six-input lookup tables are significantly faster. This is because in however, that the conservative approximation described in Section 11-B unfairly disadvantages these blocks.
F. Lizitations of Results
It should be noted that these results depend heavily on the quality of the logic synthesis tools. We have observed shifts in these results by moving from technology mappers that optimize for area to those that optimize for delay. In these experiments we have used the best mapping tools available to us.
Another limitation is the approximation of DR as a constant, as discussed in Section 11-C. While this value will certainly vary, even within one connection, one would expect it to change in similar ways for each block. The conclusions presented above would likely not change in a significant way if the variation is taken into account, as shown by the fact that the results in [ 161 for lookup tables are similar to those presented here.
IV. CONCLUSIONS AND FUTURE WORK
This paper has explored the relationship between logic block architecture and the speed of the resulting FPGA. There are two principal conclusions: 1) five-and six-input lookup tables and the muxA [ 113 logic block are all good choices for a logic block for midrange values of programmable routing delay;
2) fine-grain logic blocks, such as two-input NAND [25] S . Singh, "The effect of logic block architecture on the speed of fieldgates, result in significantly of more than 3) than these blocks.
delay (by a factor
In addition, wide AND-OR gates do not achieve comparable performance to the best blocks, but it is possible that better logic synthesis for these blocks would improve their performance.
In the future, we will adapt these experiments and tools tion we will explore the performance gains possible when C-19, pp. 141-149, 1970. to account for the area Of the gain in In addi- [29] S , Yau and C, Tang, "Universal logic modules and their applicahard-wired (fast) links between basic logic blocks are used.
