This paper describes a novel Field Programmable Gate Array (FPGA) logic synthesis technique which determines if a logic function can be implemented in a given programmable circuit and describes how this problem can be formalized and solved using Quantified Boolean Satisfiability. This technique is general enough to be applied to any type of logic function and programmable circuit; thus, it has many applications to FPGAs. The application demonstrated in this paper is FPGA PLB evaluation where the results show that this tool allows radical new features of FPGA logic blocks to be evaluated in a rigorous scientific way. * Manuscript
Introduction
FPGAs are integrated circuits characterized by a regular array of clustered programmable logic blocks (PLBs) connected together by programmable interconnects as shown in Fig. 1 . The tile shown in the figure can be thought as the basic building block of an FPGA and includes one cluster of PLBs along with the associated routing for that cluster. Clustering PLBs into regular groups has proven to improve the performance of FPGAs both in terms of area and speed [1] .
An example of a PLB is shown in Fig. 2 . The logic block is composed of a 4-input lookup on the k-input lookup table (k-LUT) which contains 2 k SRAM bits. Although the k-input LUT is very flexible, it is usually beneficial to add dedicated non-programmable logic to the PLB such as adders and XOR/AND-gates [2, 3] . These features increase the number of functions that can be implemented by a PLB without the power, speed, and area costs associated with programmable logic. However, because this reduces the flexibility of the PLB, optimal mapping of functions to these non-programmable components is difficult. Clustering is a netlist partitioning step that identifies highly connected groups of PLBs. The placement step then assigns these groups to specific locations on the device. After placement, the routing step assigns wire segments and routing switches to implement all PLB-PLB connections in the netlist. Finally, the software creates a configuration bitstream used for programming the target device with the required configuration bits.
CAD for FPGAs

Motivation
The cost of implementing a circuit in an FPGA is directly proportional to the number of PLBs required to implement the functionality of the circuit. FPGAs are sold in a number of pre-fabricated sizes. Decreasing the number of PLBs may allow a circuit to be realized in a smaller FPGA. Typical pricing is roughly linear to the number of PLBs in the FPGA device [4] .
The PLB architecture has a significant impact on the number of PLBs required to realize a particular circuit. Thus, clever PLB designs are necessary that capture the majority of the functions encountered in typical circuits. In this paper we will show how methods based on Quantified Boolean Satisfiability (QSAT) can be used to rapidly determine if a PLB architecture will achieve a high capture rate. We also show how to convert this problem into a simpler form and solve it using Boolean Satisfiability (SAT).
Background
Technology Mapping
The technology mapping step in the FPGA CAD flow converts a gate-level network consisting of primitive gates into the PLBs that are present in the target FPGA architecture. The goal of the technology mapping step is to reduce area, delay, or a combination thereof in the network of PLBs that is produced. In this work, delay is proportional to the depth of a circuit where the depth of a node is defined as the longest path from the node to a primary input. A primary input is any node in a circuit with no fanin such as an input pin. The dual to this is a primary output which is any node in a circuit with no fanouts such as an output pin. Previous work showed that the depth-optimal mapping solution can be obtained in polynomial time using a dynamic programming procedure [5] .
The process of technology mapping is often treated as a covering problem. For example, consider the process of mapping a circuit into LUTs as illustrated in Fig. 4 . Fig. 4a illustrates the initial gate level network, Fig. 4b illustrates a possible covering of the initial network using 4-LUTs, and Fig. 4c illustrates the LUT network produced by the covering. In the mapping given, the gate labeled x is covered by both LUTs and is said to be duplicated. In a duplication-free mapping, each gate in the initial circuit is covered by a single LUT in the mapped circuit [6] . However, surprisingly, the controlled use of duplication can lead to further area savings [7] . In contrast to the depth minimization problem, the area minimization problem was shown to be NP-hard for LUTs of size four and greater [8] . Thus, heuristics are necessary to solve the area minimization problem.
Another way to look at technology mapping is as a cone selection problem. The subcircuits circled in Fig. 4b are examples of cones. Technology mapping seeks to find the best set of cones that can be mapped to the current PLB architecture. "Best" is determined by the optimizing goal such as area, speed, or power. If the FPGA architecture consists solely of K-LUTs, mapping from cones to K-LUTs is a direct process since any cone with K-inputs or less can be implemented in a K-LUT. A cone with K-inputs or less is known to be K-feasible. Thus, to technology map circuits to K-LUTs, the circuit simply has to be decomposed into a set of K-feasible cones. However, if the FPGA architecture consists of generic K-input PLBs, mapping from cones to PLBs is much more difficult since PLBs cannot implement all possible K-feasible cones. For example, the PLB in Fig. 5 cannot implement a 3-input OR gate. Thus, to technology map to generic PLBs, one must discard all K-feasible cones which cannot map into the given PLB architecture. We will show how we successfully use QSAT to accomplish this in later sections.
Although more limited in functionality, PLBs offer speed, area, and power advantages over fully programmable K-LUTs. Furthermore, in general only a small subset of K-feasible cones will appear in most logic circuits. Thus, so long as a given PLB architecture captures most cones encountered in real circuits, it will be successful in implementing circuits. One way to determine a PLB's capture success is to extract a set of K-feasible cones from benchmark circuits and determine how many of these cones can fit into a given PLB where a high fit percentage is desired. Furthermore, resource usage is necessary to determine if the non-programmable components of a K-input PLB will add minimal overhead to the overall FPGA. We present two tools in this paper that do both these tasks.
Quantified Boolean Satisfiability
As stated in Sec. 1, the main contribution of this work is to examine the use of QSAT for use in PLB evaluation. QSAT is the problem of determining if a quantified Boolean formula (QBF), The first expression shows a satisfiable Boolean formula with its associated satisfying assignment. In contrast, simply by adding quantifiers to it, the QBF shown in the second expression is unsatisfiable due to the universally quantified variable x 2 .
For all practical purposes, QSAT only deals with QBFs in Conjunctive-Normal-Form (CNF, sometimes referred as a Product-of-Sums). A Boolean function is in CNF if it consists solely of a conjunction of clauses, where a clause is a disjunction of literals and a literal is any variable or its complement. Equ. 1 are examples of formulae in CNF. In CNF, the problem of QSAT can be rephrased to: Given a QBF,
where Q i ∈ {∃, ∀}, find an assignment to its variables, x 1 ...x n , such that each clause in f (x 1 ...x n ) has at least one literal that evaluates to true.
Quantified SAT Applied to PLB Evaluation
The goal of PLB evaluation is to determine how useful a new PLB architecture will be in implementing circuits. These measures are often in terms of area, speed, and power. When evaluating area, the PLB flexibility is a concern. Since a k-input PLB consist of non-programmable components, it loses flexibility in terms of what functions it can implement. Thus, the non-programmable components in the PLB will not always be used thereby having an area overhead. However, so long as the PLB is flexible enough to implement most functions, the area overhead will be minimized. A flexible k-input PLB can be characterized by how many k-input cones can fit into it where a high fit percentage is desired. The underlying question asked when determining this fit percentage is as follows: Given an n-variable Boolean function, F f unction (x 1 , x 2 , ..., x n ), does there exist a programmable configuration to a circuit, G, such that the output of the circuit will equal F f unction (x 1 , x 2 , ..., x n ) for all inputs?
Previously, robust heuristics to answer this question fell into two categories: a specialized PLB is proposed and a customized mapping algorithm is implemented to map benchmark circuits using the proposed element [9] ; specialized Boolean-matching techniques are developed to decompose a logic function in such a way so that it matches the structure of the proposed PLB [10] . Both of these techniques require a specific logic manipulation technique for each PLB which suffers a lack of generality. In our technique, however, a much more general approach is taken using a novel QSAT based approach.
Formalizing Function Fitting Problem
Assuming that a programmable circuit can be represented as a Boolean function
where
and output function of the circuit respectively, the problem of function mapping into programmable logic can represented formally as a QBF as follows.
A satisfying assignment to Equ. 2 implies that F f unction can be realized in the programmable circuit.
In order to derive Equ. 2, the proposition (G circuit ≡ F f unction ) must be represented as a CNF Boolean formula. This can be done using a well known derivation technique that converts logic circuits into a characteristic function in CNF [11] . This characteristic function describes all valid inputs, output, configuration bits, and internal signal vectors for the configurable circuit. For example, consider the truth- [
Removing Quantified Variables
Although QSAT solvers have shown initial promising results, it is often still faster to solve a QBF by removing the universal quantifiers and converting it to a SAT problem [12] . Removing the universal quantifiers eliminates the need to find multiple SAT instances for all universally quantified variable assignments, thus saving time; however, in doing so, the size of the Boolean formula increases substantially. To remove the universal quantifiers in a QBF, F , its proposition, f , is replicated to explicitly enumerate all possible assignments of the universally quantified variables. These replicated formulae are then conjoined with the logical AND operator to form a Boolean function that can be solved with SAT. In our work, we chose to remove the universal quantifiers since it proved faster to use SAT for most of the PLBs we evaluated.
In order to give better understanding to the previously described ideas, an example is given.
Assume that a 3-input function F needs to be implemented in the PLB shown previously in Fig. 5 .
In the following steps,F represents the function of the cone under consideration for mapping, X represents input vector x 1 x 2 x 3 = i, and F i = F (X ).
Step 1: Create CNF for individual elements in programmable circuit.
Step 2: Formulate the programmable circuit CNF from equations 4, 5, and 6.
Step 3: Replication of equation 7 to remove quantified variables. This formulates G T otal where a satisfiable assignment to G T otal implies F can be realized in the programmable circuit. In the previous example, the pins on the programmable circuit in Fig. 5 are not permutable.
Given the labeling convention in Fig. 5 , the function F = (x 1 + x 2 ) · x 3 can be implemented; however, the function F = (x 1 + x 3 ) · x 2 cannot. There is no need for restricting the labeling of the input pins in this manner because most programmable circuits are able to route signals to any input pins. In order to model this flexibility, virtual multiplexers controlled by virtual configuration bits, V p , are added at each input pin of the programmable circuit. Going back to the circuit shown in the last example, Fig. 7 illustrates the previous circuit with virtual multiplexers added at the input pins.
Thus, if F = (x 1 + x 3 ) · x 2 is to be mapped into this network then the virtual multiplexers would force x 1 and x 3 onto the first two pins of the circuit and x 2 to the third pin feeding the AND gate to generate a satisfiable solution. In order to add the virtual multiplexers to the previous example, the virtual multiplexer characteristic functions need to be added in Step 1, then the process proceeds normally as previously shown.
Application to PLB Evaluation
In the previous section we showed how to derive a general function mapping technique using QSAT.
We will show how we use this to evaluate PLB architectures through two tools as follows.
PLB Fit Percentage
The following shows a high-level overview of our PLB fit percentage algorithm. As stated previously,
PLBs that can capture the functionality of most cones found in real circuits are desired since their non-programmable components will not be wasted. In order to help find such PLBs, our tool can be used to return a PLB cone fit percentage where a high fit percentage is preferred. This fit percentage is found by extracting a set of cones from a list of circuits, then applying our QSAT decision step to remove cones that do not fit in the given architecture as shown in lines 1 and 2 of the following algorithm. By recording the number of cones generated and discared, a fit percentage for various PLB architectures can be found.
A version of the algorithm described in [7] is used to generate and store all K-feasible cones in the graph. The K-feasible cones are generated as the graph is traversed in topological order from primary inputs to primary outputs. At every internal node v, new cones are generated by combining the cones at the input nodes.
This tool cannot be used to fully evaluate area usage. However, fit percentage provides insights into the components that may be beneficial in a PLB. To obtain a more comprehensive evaluation of area, one should evaluate the PLB usage needed to implement a set of benchmark circuits. Using this in conjunction with an area model for the PLB, a full picture of area usage can be found.
Unfortunately, obtaining the PLB usage for circuits requires custom decomposition techniques for technology mapping to the PLB architecture. We overcome this obstacle by incorporating our function mapping technique into a K-LUT technology mapper to form a general K-input PLB mapper.
Technology Mapping Using QSAT
Our function mapping technique allows us to convert any K-LUT technology mapper into a K-input PLB technology mapper. As stated in Section 2.1, technology mapping to LUTs can be thought as a covering problem. The same is true for K-input PLBs; however, because a K-input PLB is not fully programmable, not all K-input cones can fit into the PLB. Thus, when generating cones during the technology mapping phase, cones that do not fit into the given PLB should be discarded.
This will leave a set of cones which are guaranteed to fit into the PLB architecture.
1 GenerateCones() 2 RemoveNoFitCones()
TraverseBwd() 6 end for 7 ConesToPLBs() We base our work on IMap [13] , an iterative K-LUT technology mapping algorithm. For a detailed description of IMap please refer to [13] , which shows that IMap produces amongst the best area results of any known technology mapper. The basic framework for our technology mapper is presented in the previous algorithm. First, a call to GenerateCones generates a subset of most K-feasible cones for each node in the graph, where K is the input size of the PLB. Next, a call to
RemoveNoFitCones discards all cones that cannot fit into the PLB architecture. This decision process uses QSAT as described in the Sec. 3.1. Once a set of valid cones is found, a series of forward and backward graph traversals is started to select the best cover of the graph. The cost of the cover is measured in terms of area and depth. The forward traversal, TraverseFwd, selects a cone for each node, and the backward traversal, TraverseBwd, selects a set of cones to cover the graph. Iteration is beneficial because every backward traversal influences the behavior of the forward traversal that follows it.
During the forward traversal, the algorithm updates the depth and the area flow for every node and edge encountered. Area flow is a heuristic for estimating the area of the mapping solution below a node or an edge where minimizing it leads to smaller mapping solutions [13] . At each internal node v, a cone rooted at v is selected to cover v and some of its predecessors in a mapping solution.
The quality of the mapping solution is determined by the selection procedure and thus the set of cones selected.
During depth-oriented mapping, on the first mapping iteration, the cone with the lowest depth is selected. Depth is often correlated with delay of the circuit, thus minimizing the depth of the circuit often leads to a faster circuit after technology mapping. The first forward traversal establishes the optimal mapping depth, ODepth, which can then be used in subsequent iterations to bound the depth of cones selected at every node. Using the optimal depth and the height of a node v, a bound can be defined on the depth of a cone C v as follows
The height of a node or cone is defined as the longest path from that node or cone to a primary output of the circuit. Cones that meet the bound requirement are preferred and among a set of cones that meet the bound requirement, cones with lower area flows are selected. This selection strategy ensures that the mapping solutions will still achieve the optimal depth selected while minimizing area.
During the backward traversal, internal nodes of the graph are visited in the reverse topological where a cover of cones is produced. During this traversal, the height (v) of all internal nodes are updated to the height of the cone covering it. This is for use in Equation 9 in the next forward traversal. If v is found in several cones, the largest height is used.
Finally, a call to ConesToPLBs converts the cones selected by the final backward traversal into PLBs.
Generating k-Feasible Cones
A version of the algorithm described in [7] is used to generate and store all K-feasible cones in the graph. The K-feasible cones are generated as the graph is traversed in topological order from primary inputs to primary outputs. At every internal node v, new cones are generated by combining the cones at the input nodes. In contrast to the original IMap algorithm which combined the cones in every possible way, in our work, the cone generation algorithm combines cones if they have no more (k +e) inputs in total. As long as e was set to a sufficiently high number (2 in the experiments), this heuristic sped up the cone generation process without significantly impacting the quality of the mapping solution.
Results
Validaty of Technique
To validate our evaluation process, we first evaluated a wide range of PLBs to show the generality of our technique. Secondly, we used our technology mapping algorithm to evaluate a very successful commercial FPGA architecture to see if our results are consistent with what is reported in industrial literature [14] .
Evaluation of Various PLBs
To show the power of the PLB evaluation algorithm, several unrelated PLB architectures were evaluated. Fig. 8 shows the five different PLB architectures used for evaluation. To evaluate the versatility of each PLB, a set of cones were extracted from a list of circuits taken from the MCNC benchmark suite [15] (approximately 1000 K-input cones per circuit, where K was the input size of the PLB). These cones were tested for PLB fitting using the Chaff [16] SAT solver.
The circuits used were unrelated to generate a large set of dissimilar cones. Table 1 shows the PLB fit percentage of cones per circuit. The last row shows the total percentage of all cones that fit.
Note that the cone fit percentage varies wildly for all PLBs depending on the circuit. This shows that PLB usefulness is dependent on the application of the circuit. Interestingly, PLB (b) failed for all circuits except the ALU circuit (C2670). A reason for this is because PLB (b) uses an XOR gate and XOR gates are very rare in most control circuits and are generally used for arithmetic logic.
Also, PLB (e) was only able to fit 9-input cones for a few circuits. This was expected since it is a simplified version of a commercial PLB primarily used to implement 5-input functions or a 4:1 MUX, and is rarely used as a general 9-input function generator [17] . Thus, in addition to generating 9-input functions for the PLB (e), 6, 7, and 8-input functions were evaluated to get a broader picture of the PLB's flexibility. This is shown in Table 2 . As the numbers show, this PLB looks much more useful when adding a wider range of functions.
As a second part of the evaluation, we want to give a full area estimate on the area overhead associated with a given PLB. For our experiments, we will focus on PLB(a). The 5-input function fit percentage for that PLB is quite high, but the area overhead incurred by the AND-gate is uncertain.
The AND-gate configuration in PLB (a) is very common in commercial PLBs [14] . The claimed benefit of adding an AND-gate is it allows one to chain PLBs together to create an extremely fast network of PLBs [14] with insignificant area overhead. Most FPGA companies accomplish this by restricting the routing into the AND-gate to adjacent PLBs as shown in Fig. 9 . This prevents conditions such Table 2 : The percentage of cones that fit into Fig. 8 (e).
as the AND-gate being fed by IO-pins.
To model the AND based PLB, we construct a PLB similar to what is shown in Fig. 10 (registers, SRAM bits, and other details common to a 4-LUT based architecture have been ommitted for simplicity). The architecture we are assuming is a clustered island style FPGA with 60 routing tracks per channel. Each cluster consists of 10 PLBs, 22 inputs feeding each cluster, 1 clock input, and 75% of cluster inputs routable to any given PLB input (de-populated [18] ). The depopulation of inter and intra-cluster inputs have been shown to improve area substantially while having minimal effect on timing and routing in the FPGA [18] . Since FPGA area is known to be dominated by transistor area, we use minimum-width transistors as our area metric [19] . Using the characteristics derived previously, we obtain the minimum-width transistor usage for a cluster of PLB(a). We also used VPR [19] to derive the minimum-width transistor usage of an entire tile. Using both of these metrics, we compared against a 4-LUT based architecture as shown in Table 4 .2. (counts have been derived from [19] ). The row labeled ratio is the ratio of the minimum-width transistor count of PLB(a) against the 4-LUT. As Table 4 Table 3 : Minimum-width transistor count comparisons.
Finally, to evaluate the effect of PLB(a) on depth and area, we technology mapped a large set of MCNC benchmark circuits using our SAT based technology mapper and compared the PLB resource usage against a 4-LUT based architecture. The results are shown in Table 4 . Column 'Area' indicates the PLB(a) or 4-LUT usage needed to implement the associated circuit and column 'Depth' indicates the depth of the technology mapped circuit, where each edge is given a length of 1. The 'Total' row indicates the summation of the depths and area numbers, and the 'Ratio' is the ratio of the total numbers when compared against the PLB(a) column. The results clearly show that adding an AND-gate to the PLB reduces the depth and area of the circuit. Using the claim that the cascade based routing structure of PLB(a) is much faster than a general routing channel [14] , combined with the reduced depth values shown in Table 4 , circuits implemented with PLB(a) should perform faster than a simple 4-LUT architecture. Furthermore, because the area overhead of the transistors show less than a 1% increase and the average reduction in PLB count is larger than 1% (7.9%), we expect no area overhead associated with adding a cascade AND-gate. Thus, the claim that an AND-gate has insignificant area overhead is consistent with our findings using our general PLB technology mapper.
Running Time
For PLBs of input size 6 or less, there was a negligable effect of using SAT during technology mapping where the evaluation time for each cone was under 1 milisecond. However, once the PLBs grew beyond 6-inputs, the SAT running time increased significantly, particularly when the cones did not fit into the given PLB. For illustration, we fount the average running time of 50 instances of attempting to fit an 8-input function into the 9-input PLB shown in Fig. 8 The dramatic difference in running time between PLBs with 6 or less inputs and PLBs with more than 6 inputs can be attributed to the exponential relationship in the SAT CNF expression size with respect to the number of inputs due to the replication process to remove universal quantifiers as described in Sec. 3.2. Fortunately, there has been some promising numbers from new QSAT solvers, thus the replication process to remove universal quantifiers from our original QBF expression may not be necessary. In particular, the QBF solver known as Quantor [20] was very successful in solving some hard PLB fit instances as shown in Table 5 . The QSAT-Quantor column shows the Quantor QSAT solver running time in seconds when attempting the fit various functions of k-inputs.
Specifically, the 7-input shows results for when attempting to fit a 7-input function into a 7-input PLB and the 8-input shows results for when attempting to fit an 8-input function into an 8-input PLB; the SAT-Chaff column shows the running time in seconds when attempting to solve the same problem, but first removing quantifiers and solving the problem with the Chaff SAT solver. As Table 5 show, for small functions of less than 6-inputs, the difference between the Quantor and Chaff is negligable, however, Quantor clearly show benefits for the larger fit instances with 7-inputs PLB size QSAT-Quantor (sec) SAT-Chaff (sec) < 6-input 0.01 0.01 7-input 7.12 334 8-input 1.86 60.2 Table 5 : Comparison of using the Quantor QBF solver v.s. removing quantifiers on the QBF and using Chaff.
or more. These gains were specific to the Quantor QBF solver, and was not seen on some other popular QBF solvers such as Quaffle [21] . Quantor uses a similar replication process to our approach in order to remove universal quantifiers. However, unlike our approach, Quantor applies advanced optimizations to the replicated CNF before solving the QBF. These optimizations are suspected to cause the performance gains seen by Quantor.
Conclusion and Future Work
This work represents only the first step in using QSAT in FPGA CAD tools. Our research will progress to speed up the function mapping solver. We hope to use the Quantor QSAT solver in our PLB evaluation process. Initial numbers suggest that for large PLBs with 7-inputs or more, a dramatic speedup is expected. 
