This paper introduces a new design approach that combines stages of logic and physical design. The logic function is synthesized and mapped to a two-dimensional array of logic cells. This array generalizes PLAs, XPLAs and cellular Maitra cascades. Each cell can be programmed to a wire, an inverter, or a two-input AND, OR or EXOR gate (with any subset of inputs negated). The gate can take any output of four neighbor cells and four neighbor buses as its inputs, and sends its result back to them. This two-dimensional geometrical model is well suited for both fine-grain FPGA realization and sea-of-gates custom ASIC layout. The comprehensive design method starts from a Boolean function, specified as SOP or ESOP, and produces a rectangularly shaped structure of (mostly) locally connected cells. Two stages: restricted factorization, and column folding, are discussed in more details to illustrate our general methodology.
INTRODUCTION
ate arrays and standard cells are currently the most popular technologies used in ASIC design. On the other hand, the two level Sum-of-Products (SOP) structure is widely used in Programmable Logic Devices (PLDs). For two-level logic, there are effective synthesis tools for both SOP minimization [23] and ExclusiveSum-of-Products (ESOP) minimization [25, 28] . While the standard PLA is composed of an AND plane for product terms, and an OR collecting (output) plane, the recently introduced XPLA (Exor PLAto [25] ) has an AND plane for product terms and an EXOR collecting plane. Another [5, 6] (now Atmel [2] ), Algotronix [1] (now Xilinx), Pilkington 14] , Motorola 16] , Plessey, Apple, Toshiba, and National Semiconductor. Although quite different in details, these fine-grain FPGA architectures have some very specific common properties. Below we will create a generic model of a "twodimensional logic array" that includes most of the important properties of these fine-grain FPGA architec-316 N. SONG ET AL tures. Although quite simple, the model is also well suited for custom ASIC design in sea-of-gates or similar technologies.
A very practical and interesting research problem related to new programmable architectures is to find some scientific evidence and experimental confirmation with respect to merits of the existing fine-grain architectures: how "good" are they? can they be improved? how? To our knowledge, while designing these architectures ([ 14] being the only exception), there was no research on selecting the best cells' functionality, their connection patterns, a number and location of buses, etc. The architectures were created purely on the "try and error" principle, with several modifications in next chips' generations and software redesigns. It is then very important to create new general methodologies and related prototyping software to help design new fine-grain architectures. We propose here such a methodology and related software. We will call it the "Fine-Grain FPGA Designer's Work Bench".
Our approach to create optimal fine-grain FPGA architectures is through the Device and Algorithm CoGeneration. Conventionally, the devices are designed first. Next, the optimization methods are created to support the synthesis and mapping for these devices.
When the design of FPGA architecture is completed, with no consideration of future physical design problems, the software tool design may become unnecessarily complex at the later stages. If the existing algorithms were evaluated on prototype architectures, and the corresponding improved algorithms were created concurrently with the devices, the creation of the high-performance tools would be significantly easier. The tools should be also able to better utilize all the distinct properties of the devices.
The best way to deal with circuits of high complexity is to preserve their regularity as much as possible. Logic synthesis and technology mapping are still performed separately (with a recent exception of combining the technology mapping with placement [7] ). However, a good logic synthesis result does not necessarily guarantee the good result of technology and physical mappings, since the physical constraints are not taken into account at the stage of logic synthesis. For instance, algebraic factorization [4] is a popular method to generate multiple-level logic forms from two-level logic expressions.
However, without taking certain layout-related constraints into account, such as the limited number and connectivity of buses, a synthesis result having less literals may need more space for routing than another result with more literals.
In the traditional approach where the logic optimization phase is followed by technology mapping and then placement and routing, a large number of logic cells are used for wiring connections or left unused at all. This problem is mainly caused by not preserving local connectivity during the synthesis steps. Therefore, frequently, local buses are used to complete even very short connections, which increases circuit delay. Better solutions that use different logic implementations with a larger number of logic cells but with predominantly local connections are lost during the technology mapping. The traditional technology mapping algorithms optimize area by minimizing the number of logic cells used, and circuit delay by optimizing the number of logic levels. In the "macro block" approach which is currently used in the industry, a technology independent multi-output representation of a Booleln function is covered with a minimum number of small standard subfunctions (macros) which have no uniform shapes, and do not preserve local connectivity between macros. Consequently, the number of cells which need to be used for routing between macros is very large. On average, about 70% of the area occupied by the design in ATMEL 6000 series fine-grain FPGAs [2] is wasted if the traditional synthesis methods are used [6] .
Several approaches have been proposed that use various layout constraints during logic synthesis. The first research on applying variable ordering in factorization is reported in [26] . The approach based on trees and decision diagrams (which are Directed Acyclic Graphsm DAGs) [8, 15] has been also adapted to fine-grain FPGAs [27, 30] . It makes use of the diagrams' regularity and the specific types of logic gates (AND/EXOR, MUX), used in these decision diagrams. These gates are also wellsuited to the existing devices from Atmel or Motorola [27, 30] . In some cases, however, when the circuit is finally mapped to a rectangular area, the triangular structure of the tree/DAG decomposition may waste a large amount of area for routing.
Therefore, we propose here a totally new approach to combined logic synthesis and physical design. Starting from an observation that the architectures have rectangular arrays of simple, locally connected cells, we create our design method especially for such arrays. The In addition, similar to PLAs and gate matrix layout [29] , our CMLAs can be folded in many ways. All well-known algorithms for folding and gate-matrix layout can be thus used [7, 9, 10, 11, 12, 13, 29] . However, both the properties of our general array model and the specific properties of particular commercial FPGAs call for new approaches to this folding problem [24] .
The [3, 17, 18, 19, 31] . In these studies, however, the connectivity patterns of cells were too restricted and the buses were mostly absent. Because of these limita- The genetic architecture proposed by us is shown in Figure 1 . We will call it the "Generic 2-Dimensional Logic Array", or "2-D Array", for short.
The cells which are programmed (electrically configured, personalized) to logic gates will be called logic cells. A routing cell is a cell which passes a signal (wire) only. An empty cell is a cell unused in a mapping.
The [21] and Universal XOR Forms [22] . 3 . The constrained factorization [24] .
TWO DIMENSIONAL LOGIC ARRAYS 319
The classical methods seem to be too restricted for both the generic and CMLA models, but some of the algebraic ideas introduced by them seem still worthy of further investigations, and can be used to' improve the efficiency of the methods proposed here. In the remaining of section 3 we will introduce two new methods: one is based on Universal XOR Forms [22] (section 3.1), and the other is based on restricted factorization to complex (Maitra) terms (section 3.2). While the first (Boolean/ spectral) method is more general and usually leads to better solutions because of extremely large space of solutions it searches, the second (algebraic) method in our current implementation leads to much faster programs.
form. In general, the coefficients of the orthogonal expansion for a Boolean function are obtained by multiplying the matrix of this expansion by the vector of minterms of this function. Matrix of expansion is an inverse to the matrix of basis functions [21] . By repeating this procedure for the expansion matrices corresponding to all the bases from some family F of bases, and selecting the base for which the minimum number of coefficients are non-zero, one obtains the exact minimal form in this family F of bases.
The total number of UXF forms was shown to be 2(2"-1)(2"-1) 2" 2" =lI1 2i 1) 3.1. Synthesis Based on UXF Forms 3.1.1. Universal XOR forms In the vector space over GF (2) formed by the set of n-variable switching functions under addition mod-2, every switching function can be represented uniquely as a linear combination of the basis functions [22] . [8] . Some UXF forms also include terms which require gates other than AND and NOT for their realization. They include various AND/ OR/EXOR canonical forms [21, 22] .
One well known XOR canonical form is that of the Reed-Muller Canonical (RMC) form. The standard canonical sum-of-minterms form can also be considered an UXF. As an example, the monoterms of the RMC (the The Reed-Muller expansion is a particular example of an orthogonal expansion and the RMC is a particular UXF where n is the number of variables in the function [22] .
Among all these forms, there are those families of forms which have easy circuit realization for a given fine-grain FPGA architecture. In the orthogonal plane, it is assumed that the primary inputs are carded across the levels through buses. The uxf-terms are then constructed through allowable gates in the level. As an example, the product ac can be produced by getting a signal from the bus, passing it through the "b-cell" via a wire (a connection cell) and then ANDing a and c in the "c-cell". In similar way, various terms composed from connection cell ("wire"), AND, OR, and EXOR cells can be realized in the orthogonal plane. An example is shown in Figure 3 .
While the number of all UXF forms is enormously large, the constraints of the technology limit the number of basis vectors that can be utilized in a given architecture. As the rows o'f arrays realize the basis elements of a given basis which have a coefficient of 1, it may not be possible to realize every possible basis element in a single row. As an example, let us assume that the array is comprised of only AND gates. Furthermore, let us la,bl FIGURE 3 An example of an orthogonal plane. Figure 7 needs two output columns. The rows are now permuted to avoid overlap of nets connected to each column. Then the two output columns can be combined into one column, as shown in Figure 8 , and the total number of output columns is reduced. ab c Based on the above discussion, the outline of the combined factorization/folding approach is the following. complex terms as shown in Figure 9 . A cube (product term) B and cube C in Figure 9 (a) can be reshaped to B' and C' in Figure 9( After the output folding, the final result is shown in Figure 10 .
RESTRICTED FACTORIZATION THEORY
Since the outlined above factorization problem involves more constraints than the standard factorization, and b d a c f2 FIGURE 10 The final CMLA of the two-bit adder after folding.
since the conventional algebraic division method [4] does not take these constraints into account, we have developed a new factorization method for this specific problem. We call this restricted factorization. The new method is based on cube calculus operations [23, 25, 28] .
In this section, the concepts of distance and difference of two product terms, and a cube operationmexorlink are first introduced. Then the method to generate complex terms from product terms is discussed. The algorithm to combine product terms to complex terms is based on calculating the difference and the distance of the cubes for every pair of cubes representing product terms. This is used to decide whether two product terms can be combined to a complex term. It also determines the cases when the cubes need to be reshaped in order to increase the possibility of re-combining them. This reshaping is done using the exorlink operation. 
Definitions
In positional cube notation, a literal with a positive polarity (a variable with no negation) is coded as 10, a literal with a negative polarity (a variable with negation) is coded as 01, and a missing literal is coded as Figure  11 . Definition 2: The distance of two terms is the number of variables for which the corresponding literals of these terms have different polarities. Figure 11 .
In Figure 1 l a, three arrows indicate the three pairs of literals with different values. Since the difference of the two terms is three, three resultant cubes are generated. 
FIGURE 11 The method of calculating the exorlink of product terms abe and ab'cde. Example 11: a b q)a c are not combinable. Since the difference of these two terms is 3, performing exorlink operation on these two terms will generate three resultant product terms. These three resultant terms can not be combined to a complex term. Further exorlink operations can be applied to any two of the three resultant terms so that they are reshaped to other product terms. By trying all the possibilities one can prove that these product terms can not be combined to one complex term no matter how to reshape them. Definition 5: Two product terms are referred as combinable either when these two product terms are directly Based on the above discussion, the algorithm to create complex terms from product terms has been created, the pseudo-code of which is given in Figure 12 .
Input: A minimized ESOPs with pt product terms.
(1) /* record the initial result as the best result */ best_result initial result (2) 19 product terms as shown in Table I .
On the left of the table each row corresponds to a product term, each "1" indicates a variable, each "0" b d c a) . Creating desired orders is repeated until all the pairs of product terms are checked. All the desired orders are recorded. According to the algorithm from Fig. 12 , the best order selected is: (db c a e).
Based on this order, the complex terms are generated (Table II) .
For instance, row 1, a c and row 4, a b d can be factorized as t2 (d b c)a. Let us observe that the complex terms t3, t4, and t5 are reverse Maitra terms, and all other complex terms generated are forward Maitra terms. There are no bidirectional terms in this example.
In this example, 19 
OUTPUT COLUMN FOLDING
In this section we present a new algorithm for the multiple column folding problem. This problem is similar to the gate matrix optimization problem, but with additional minimization objectives. Our input is a list of terms and associated output functions, called nets. The netlist obtained from the logic synthesis stage is already organized as a two-dimensional rectangular array, which preserves local connectivity. algorithm, for this NP-hard [29] problem, which uses dynamic programming is presented in [9] . Both [20] , a graph-theoretical approach based on interval graphs is used. The literature published later generally uses the same problem formulation, however, solves the problem in two different ways. Ohtsuki, et al. [20] have solved this problem by generating an initial solution and than improving this solution iteratively. Wing et al. [29] generate many interval graphs heuristically, and than select the best one. Huang, et al [11] followed the same direction but in addition they considered also a layout aspect ratio. Greedy approach of assigning gates to rows one at a time was first suggested by Deo, et al. in [9] . Later, Huang, et al. [12] gave an algorithm that first selects nets and than selects and assigns gates according to the previously selected nets. In the paper written by Hu, et al. [10] We present the algorithm which solves the column folding problem assuming the number of terms is fixed. This is analogous to the GM problem in which the number of gates is fixed. To minimize the area (a number of columns used), we try to find an optimum assignment of terms to rows of the CMLA, such that the number of overlapping nets is minimized. The different nets can be put in the same column, if they do not overlap. In addition, to decrease circuit delay, we minimize a number of logic cells used for routing, the GM-RCM problem. We choose to solve this problem by solving two subproblems separately. First we find an optimal ordering of terms that minimizes the number of cells used for routing. Next, we find an assignment of nets to columns such that no nets overlap and the number of columns is minimized.
Definitions
Definition 6: A multi-net term is a term connected to more than one net. 
Our Approach
The key idea of our algorithm for column folding is to use a global but simple approach. In previous work [11, 12, 29] , max-net-number was used as a guide for heuristic moves. However, this number gives only a lower bound on the solution, but no information on how and if the lower bound can be achieved. Moreover, if there is a loop in the input file, and at least one term that belongs to that loop is a max-net-term, the lower bound solution is impossible. The problem of finding these loops is also quite complicated. In addition, finding all these loops can only help to estimate the lower bound, but cannot help to find the minimum solution. When the rows are permuted, the column-length of each row, as well as the max-column-length change. Therefore, in our method the max-net-number is used as a lower bound, but we use column-lengths, and especially the maxcolumn-length as the guide for the heuristic moves.
Two important ideas are introduced in this work. The first one, which allows us to achieve very good results without exhaustive row permutations is presented in step three and the second one, which finds an optimum net assignment very efficiently is presented in step five of the algorithm description. The main steps of our algorithm are presented below. Term  Net  Term  Net   tl   2  t9  2 3  t2  3  tl0  t3  4   tll  3  t4  4  t12  4  t5  5  t13  4  t6  5  t14  6  t7  6  t15  8  t8  7 The net 2 in Table III shows the relation between the complex terms and the output function f2: since terms tl and t9 occur in net 2, f2 e t9.
From the term-net netlist presented in Table III the output functions can be reconstructed. For example, since terms tl and t9 are the only terms that belong to net 2,f2 tl q)t9.
The final FPGA implementation of the example function is shown in Fig. 13 
RESULTS AND DISCUSSION
The results of the factorization method and the folding method specific to Atmel 6000 architecture are presented in [24] , so here we will concentrate only on the general 330 N. SONG ET AL [12] .
The decomposition of the problem into two subproblems does not influence the quality of the overall solution. The minimum solution to the first subproblem defines the lower bound for the solution of the second subproblem. And as it was mentioned previously, the solution with a number of columns equal to the lower bound of the second subproblem defined by the maxrow-length can always be found with our algorithm.
The MINCOL algorithm is written in C and implemented on a SPARC workstation. The preliminary results are very encouraging for solving output column folding for the fine-grain FPGA mapping problem, GM-RCM, as well as for the general Gate Matrix problem.
CONCLUSIONS AND CURRENT RESEARCH
The main technical contribution of this paper is the proposition of a comprehensive design methodology for circuits.
The methodology proposed by us is totally new and must be thus tested on many more practical examples, together with the pre-and post-processing algorithms. Currently the most severe limitation of the method is the size of circuits that we can deal with. Especially, a fast algorithm to generate all UXF forms must be created. However, the method can be applied to parts of a circuit which was first partitioned or decomposed using general methods. It can be thus treated as a generator of large custom macro-blocks. The area of research that needs further investigation is also the comparison of the speed of the synthesized and device-mapped circuits to the solutions obtained by standard logic synthesis tools and mapped to fine-grain FPGAs using respective commercial mappers. Thanks to a recent generous donation of tools by Atmel, this research becomes now possible to US.
Although our method is particularly tuned to Atmel and Motorola architectures, we believe that the results of this paper can be also used to create new architectures and high-performance methods for other fine-grain FPGAs. Such 
