We examine empirically the performance of multi-level logic minimization tools for a lookup table-based Field-Programmable Gate Array (FPGA) technology. The experiments are conducted by using the university tools misII for combinational logic minimization and mustang for state assignment, and the industrial tools xnfmap for technology mapping and apr for automatic placement and routing. We measure the quality of the multi-level logic minimization tools by the number of routed con gurable logic blocks (CLBs) in the FPGA realization. We report three results: a) there is a linear relationship between the number of literals and the number of routed CLBs, and b) in all 34 MCNC-89 benchmark nite state machines, one-hot state assignment resulted in substantially less CLBs than any other state encoding methods available in mustang, c) we present a delay model to provide routing delay prediction based on fanout, and apply the model to estimate the delays of the FPGA implementation of logic expressions prior to technology mapping, place and route. These results are useful for prototyping a design in FPGAs, and then transferring the design to a di erent technology (e.g., CMOS standard cell). It provides valuable information on the di erence in performance of a design realized in di erent technologies.
Introduction
The advent of FPGA technology provides a mechanism for rapid prototyping. When a prototype is operational, the design may be transferred to a di erent technology (such as custom or semi-custom VLSI) for mass production. It is valuable to be able to predict di erences in performance of a design across di erent technologies.
One way to achieve this is to use the same set of design tools at higher levels in the design ow, such as multi-level logic minimization tools for technology independent minimization, followed by quick technology mappings to di erent technologies. We pose the following question. Is there an intermediate form that serves as the basis for estimation of performance of a design across di erent technologies? The performance of multi-level logic minimization tools for CMOS standard cell implementation is relatively well known and studied. But little is known about the performance of multi-level logic minimization tools with respect to FPGAs. So to answer this question, we examine empirically the performance of multi-level logic minimization tools for a lookup table-based FPGA realization 1]. The experiments are conducted by using misII2.0 for combinational logic minimization and mustang for state assignment. The vendor's supplied program xnfmap is used for technology mapping, and apr is used for automatic placement and routing 1 . We measure the quality of the multi-level logic minimization tools in relationship with the FPGA technology by the number of routed con gurable logic blocks (CLBs) and speed of the realization of the prototypes. We present the following results:
1. There is a linear relationship between the number of literals and the number of routed CLBs. 2. In all 34 MCNC-89 benchmark nite state machines, one-hot state assignment resulted in substantially less CLBs than any other state encoding methods available in mustang. 3 . We present a delay model to provide routing delay prediction based on fanout, and apply the model to estimate the delays of the FPGA implementation of logic expressions prior to technology mapping, place and route.
A lookup table-based FPGA architecture
Xilinx FPGAs are dense arrays of gates that can be con gured { and recon gured { by the system designer through software, rather than by chip manufacturer in the fabrication line. With realization times measured in hours, systems incorporating up to thousands of gates on a single FPGA can be designed, programmed and evaluated within a few weeks 1]. The basic building block which provides the logic functionality in the XC3000 series FPGA architecture is shown in Fig. 2 . This is a Con gurable Logic Block (CLB), which has a maximum of 5 logic inputs. Each CLB has a programmable combinational logic section and two ip-ops. A programmable combinational logic section can implement any 5-variable logic function or two functions of at most 4 variables each, as long as they have at most 5 variables altogether. Each CLB also has two outputs called x and y, which drive the programmable interconnect networks (not shown). The outputs of the combinational logic section can go directly to x and y or through ip-ops FF1 and FF2.
3 A system for rapid prototyping using FPGAs
As depicted in Fig. 1 , our design environment is based on wireC which uses xdp as the front end for schematic entry 2]. We have con gured wireC to handle eqn format le generated by misII Fig. 3 shows an empirical relationship between the number of literals and the number of (routed) CLBs that we obtained. It shows the ratio of literals to CLBs is roughly 5:1. Some state assignment strategies tend to generate designs that are not routable; this will be elaborated in Section 5.
The second suite of circuits come from the MCNC-89 combinational logic benchmarks. Only those circuits that can be implemented with the XC3000 series FPGAs are included. The circuits are mapped using three di erent lookup table-based technology mappers: Chortle-crf 9], xnfmap and rmap 10]. Fig. 4 shows an empirical relationship between the number of literals and the number of (routed) CLBs. Again, it shows the ratio of literals to CLBs is roughly 5:1, with no essential di erence among di erent technology mappers. The only exception is the C499 ECC benchmark which has a large number of XOR gates.
This empirical result can be applied to guide the partitioning of a large design into multiple FPGAs. It can also be used to estimate whether a design can be accommodated in an FPGA, simply by counting the literals.
Characterization of technology mapping
In this section, we provide some intuition as to why the ratio of literals to CLBs is roughly 5:1. This requires some understanding of the interaction between misII and the mapper xnfmap. Notice that we are actually measuring the performance of misII in relationship to a single technology mapper xnfmap 2 . Other mappers for FPGAs exist 9, 11, 12, 13, 14], but they are limited to combinational circuits. We believe that the pairing operation in xnfmap is quite universal, and would exist in any other future mapper. As mentioned earlier, an XC3000 CLB has a maximumof 5 logic inputs. A programmable combinational logic section can implement any 5-variable logic function or two functions of a maximum of 4 variables each. Each CLB also has two outputs, x and y.
With the idea that the combination of gcx, gkx, and decomp operations in misII tends to break complex logic expressions into smaller subexpressions by factorization and sharing of common subexpressions, and a technology mapper would attempt to maximize the utilization of a CLB by pairing of small subexpressions. We o er a simple explanation of why the ratio of literals to CLBs is roughly 5:1. Figs. 5.a to 5.d enumerate all the con gurations in which literals can share a CLB. The numbers in the gures illustrate the lower bounds on the ratios of literals to CLBs for each con guration.
The number of literals can be much larger than the number of inputs, but misII doesn't seem to generate this type of expression with the benchmark circuits. Also, the exact ratio of literals to CLBs depends on the relative occurrences of these con gurations.
In particular, if all con gurations are equally likely and all the literals appear as input variables to the CLBs (i.e., no intermediate variables are generated by the mapper), then we have Table 1 .a. In practice, not all con gurations are equally likely. We studied 60 designs and determined their literal to CLB ratios. This statistic is summarized in Table 1 .b.
State assignment for FPGAS
We examine the problem of assigning values for the states in a nite state machine (FSM) so as to minimize the number of CLBs and delay. Research in multi-level logic minimization employs literal count in the combinational part of the FSM as the indicator of the quality of a state assignment algorithm 15, 16] . For that matter, it is not widely reported that one-hot encoding provides small literal counts. Perhaps it was dismissed because the number of ip-ops employed in the one-hot encoding scheme is the number of states. Hence, research in state assignment targeting multi-level logic minimization has focused on minimum-length (or close to minimum-length) encodings.
It is a common belief that the cost in logic complexity of one-hot encoding is usually somewhat higher than for other methods, but it is generally not far out of line. Moreover, because the transitions in one-hot encoding are all two-step, it leads to circuits slower than could be built employing a single-transition-time assignment 17, p.177]. However, in the FPGA technology, ip-ops are essentially free in XC3000 series, as each CLB has one or two programmable ip-ops. The naive one-hot encoding after all may be the winner over elaborate minimum-length encoding schemes developed 18] 3 . We pose the following question. What is the best strategy, measured in terms of the number of CLBs and speed, among the options provided by the state assignment program mustang 15]?
State encoding for minimizing CLBs
The nite state machines are from the MCNC-89 benchmarks 8]. The experiment is conducted using mustang for state assignment, and misII for logic expression minimization applying the standard script once. The logic expressions are translated to XNF format and technology mapped by xnfmap to produce LCA les. The LCA les are then placed and routed by apr to produce the nal design, all using XC3020PC-84 packages . Tables  2 and 3 show the number of CLBs and literals for most of the encoding schemes available in mustang. We emphasize that the number of CLBs reported are the number of routed CLBs used to implement a complete FSM on an XC3020PC-84 package, not just the combinational part of it. Designs with more than 64 CLBs are not routed. Clearly, the number of CLBs using one-hot encoding is substantially less than any other encoding scheme available in mustang for all 31 nite state machines in Tables 2 and 3 .
To demonstrate further that the superiority of one-hot is not simply an anomaly of the benchmark set with small number of states, we introduce three larger designs: planet, scf, styr into the experiments. We report the literal and CLB counts in a separate Table 4 ; the designs are routed using di erent packages. The same trend that one-hot is superior is again observed for these larger designs. More importantly, FSMs encoded in strategies other than one-hot often cannot be completely routed.
In general, the number of literals can be further reduced by using a much longer optimization script in misII. However, the literal counts for one-hot encoding using the short standard script are comparable to other encoding methods using the long optimization script.
State encoding and delay
It is informative to know the speed of the nite state machines under di erent methods of encoding. Table 3 shows the speed reported by the design editor xact of FSMs encoded with di erent strategies. Again, it shows that one-hot-encoded FSMs outperform FSMs encoded in other schemes overwhelmingly in speed. It is because one-hot encoding produces next-state logic functions which have fewer inputs than the next-state logic functions from minimum-length encodings. The one-hot encoding scheme suits the FPGA architecture which has limited fanin but ample ip-ops in a CLB. We present the logic equations of the one-hot encoded FSM bbara and the one generated by the minimumlength fanout-oriented option (-tp) in Tables 5.a The structure of a logic circuit is dictated initially by the design. In the course of implementation, this structure may be altered by tools such as the logic minimizer, and the technology mapper. For example, if the number of variables in a logic expression exceeds 5 (infeasible), then the logic minimizer/technology mapper would have to decompose the logic expression into feasible sub-expressions. Intermediate variables and nodes are created during the decomposition, which in e ect would increase the delay.
Our premise is that the structure of a design is not altered much by the mapper. This is particularly true for a design consisting primarily of small logic expressions, for example, the combinational logic of a one-hot encoded FSM (see Table 4 .a). In such designs, the CLB and I/O Block delays are straightforward to determine, but the interconnect delays are sensitive to the structure of a design; we estimate the interconnection delay from the logic expressions based mainly on the fanouts of the logic variables. We shall present evidence to support this conclusion.
A delay model
In our delay model, we relate the interconnect delay to the number of fanouts of a signal. This arose from the observation that the XC3000 FPGA architecture has limited routing resources. As the fanout of a signal grows, it uses up more and more routing resources, and hence increases the delay. Fig. 7 shows the \nominal delay" of a signal versus fanout, which is determined in XACT by packing sink CLBs as closely as possible around a source CLB. For example, in Fig. 8 , block EE is the source CLB and the rest are sinks. Block EE has a fanout of 24. So the \nominal delay" is not necessarily the best-case delay. The worst-case delay can be quite large and therefore is not as meaningful as the nominal delay. This delay model tends to underestimate the delays of large circuits because routing congestion is not taken into account. Also, the delay model may overestimate the delays of small circuits because of pairing (fanin to the same CLB). There are two knees in the curve in Fig. 7 . The initial (roughly linear) portion of the curve would indicate that the signal is transmitted through the general-purpose interconnect and switch boxes. As the number of fanouts exceeds 8, the router starts to consume the long lines for routing. The second knee would indicate that another level of long lines and general-purpose interconnect are used. We apply the delay model to a timing veri er to estimate the \worst-case" propagation delay of one-hot encoded FSMs from their logic expressions. We plot both the measured and the estimated delays in Fig. 9 . The mean error is ?4:12ns, and the overall relative error (O.R.E.) is 0.18, indicating a fairly accurate estimation.
Conclusion
We have observed certain empirical facts about the performance of multi-level logic minimization tools in relationship to a lookup table-based FPGA technology. These observations are made based on speci c tools that are commonly used in lookup table-based FPGA designs. There is no intention to claim that these observations are universal. First, we suggested that as a rule of thumb, dividing the literal counts of a set of logic expressions by 5 gives an estimate of the number of routed XC3000 CLBs to implement the logic expressions. This result can be applied to guide the partitioning of the logic expressions portion of a large design into multiple FPGAs. We can estimate the number of routed CLBs to implement the logic expressions simply by counting the literals.
We extended the idea to estimate the delays of the implementation of logic expressions prior to technology mapping, place and route. An empirical delay model is suggested which can be used for delay prediction based on logic expressions. Our results suggest that logic expressions are a good intermediate form to bridge the estimation of performance of a design across di erent technologies.
Second, we suggest that the one-hot state encoding strategy is a good candidate for nite state machines targeted for lookup CLBs, 60 designs. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11 2 Number of literals and CLBs for di erent state encoding strategies: obtained using misII standard script and Xilinx xnfmap and apr on an XC3020PC84-100. Mustang options are: -a graph embedding performed by using a simulated annealing-based algorithm; -e expand state codes to use up unused state codes; -n a state assignment option which uses a fanin-oriented algorithm to produce an encoding of states; -p a state assignment option which uses a fanout-oriented algorithm to produce an encoding of states; -t a variation in fanin and fanout oriented heuristics which sometimes produces better results; -r using random encoding with the default seed; -ran machine is encoded using random encoding with a random seed; -s states of the machine are assigned sequential codes. Table 2 : Number of literals and CLBs for di erent state encoding strategies: obtained using misII standard script and Xilinx xnfmap and apr on an XC3020PC84-100. Mustang options are: -a graph embedding performed by using a simulated annealing-based algorithm; -e expand state codes to use up unused state codes; -n a state assignment option which uses a fanin-oriented algorithm to produce an encoding of states; -p a state assignment option which uses a fanout-oriented algorithm to produce an encoding of states; -t a variation in fanin and fanout oriented heuristics which sometimes produces better results; -r using random encoding with the default seed; -ran machine is encoded using random encoding with a random seed; -s states of the machine are assigned sequential codes. Figure 9 : Delays of one-hot encoded FSMs: measured (in XACT 2.12) vs estimated.
FSM

