Abstract-Two regular circuit structures based on the programmable logic array (PLA) are proposed. They provide alternatives to the widely used standard-cell structure and have better predictability and simpler design methodologies. A whirlpool PLA is a cyclic four-level structure, which has a compact layout. Doppio-ESPRESSO, a four-level logic minimization algorithm, is developed for the synthesis of Whirlpool PLAs. A river PLA is a stack of multiple output PLAs, which uses river routing for the interconnections of the adjacent PLAs. A synthesis algorithm for river PLAs uses multilevel logic synthesis, simulated-annealing, and ESPRESSO targeting a combination of minimal area and delay.
I. INTRODUCTION
Regularity is a feature which can provide better guarantees that the layout designed by a computer-aided design (CAD) tool is replicated in the fabrication. As the limits of the mask-making system (finite aperture of projection) are reached with smaller geometries, the actual layout patterns on the wafer differ from that produced by a CAD tool [1] . Although predistortion can be added to offset some of the real distortions [17] , the number of layout patterns generated by a conventional design flow can make this task take an unreasonable amount of time and generate an enormous data set. Beyond what optimal predistortion could do, regular fabrics can reduce variations further. Also, whenever a technology is migrated to smaller feature sizes, the entire standard-cell library needs to be rebuilt; reuse of the old library is almost impossible because factors like cell speed, powers, etc., do not scale. Regular circuit structures, or noncell-based structures, need much less library-rebuilding effort. A third motivation for using regularity is the timing closure problem, which arises because the design flow is sequential; early steps need to predict what the later steps will do. Inaccurate prediction leads to wrong decisions which can only be discovered later, making design iteration necessary. Preventing such iterations is difficult, but use of regular structures can make estimation much more accurate.
Currently known regular structures include memory or lookup table (LUT)-based structures [9] and programmable logic arrays (PLAs). A PLA is composed of regular patterns, and its area and delay are directly related to the logic function it represents. The result of a sum-of-products (SOP) minimization can be mapped directly to a PLA [2] , [3] . Technology mapping, required in standard-cell designs, is not necessary, nor are placement and routing necessary for a single-PLA circuit. The PLA structure is also library free. For more complex logic functions, a multilevel structure is needed [4] , [5] . In multilevel logic minimization, the entire circuit is represented as a network of nodes where each node is a SOP logic function. A common approach is to optimize both the entire network as well as each node and then transform the circuit to a network of library cells via technology mapping. A natural generalization is to build a network of PLAs (NPLA) from the minimized network without technology mapping [6] . Although some desirable features of single-PLAs, such as technology-mapping-free synthesis, are preserved, NPLAs require block-level placement and routing, which are not as well developed as similar gate-level algorithms. Two structures are presented, both of which maintain the regularity of the single-PLA and eliminate the placement and routing irregularity of the NPLA: whirlpool PLA (WPLA) and river PLA (RPLA). These structures share the following advantages: 1) regularity; 2) well-buffered inputs and outputs; 3) use of less metal layers as compared to standard-cell and NPLA implementations; 4) no general placement and routing; these are done during logic synthesis. WPLAs and RPLAs differ in the sizes of circuits they can implement efficiently; WPLAs can handle about 1 K-gate circuits, and RPLAs can handle up to 10 K-gate circuits. For future deep submicron (DSM) designs, regularity becomes a key issue [7] , making such regular structures increasingly attractive. For larger circuits, or even whole chips, multiple WPLAs and/or RPLAs will be required. Then, block-level placement and routing are needed to build modules out of multiple WPLA and RPLA circuits. To maintain regularity (and predictability) at a global level, a regular global placement and routing scheme with consideration for buffer insertion has been proposed [14] .
The paper is organized as follows. In Section II, some PLA preliminaries are introduced. WPLAs and RPLAs are discussed in Sections III and IV, respectively. Experimental results are reported in Section V, and Section VI concludes.
II. PROGRAMMABLE LOGIC ARRAYS
The building block of PLA circuits is the NOR programmable array, as illustrated in Fig. 1 . A basic dynamic configuration is depicted. Besides the signal inputs, there is a "start" (low effective) signal determining whether the NOR is evaluating or precharging [8] . Each dot in the array is either a connected transistor controlled by the vertical line or a void transistor with no logic effect. The horizontal lines in the array create the wired-NOR function of the controlling signals. The input buffers produce both the inverted and the noninverted input signals. The working periods of buffers and the precharging switches are nonoverlapping, controlled by a "start" signal. In addition to the normal 0278-0070/03$17.00 © 2003 IEEE horizontal lines for outputs, a delay-tracking line produces a "done" signal (also low effective), indicating that the evaluation is done. Therefore, dynamic NOR arrays can be cascaded; the "done" of one array driving the "start" of the succeeding array. (4) where the first term is the delay caused by the buffer, and the second term is the "turn-on" delay of the connected transistor at (v, h). In (4), DIB and dIB are the load independent and dependent delays of the input buffer, and DTR and dTR are the load independent and dependent delays of the transistor. L is the unified external load attached to the output, which is usually the input capacitance of the buffer in the next array and the capacitance of the precharging circuit. Again, the delay of a NOR array is totally determined by its embedded logic, that is, the T r(v; h)s.
To implement an SOP, a PLA is built. A common configuration is NOT-NOR-NOR-NOT. The first half of the PLA NOT-NOR is called the AND-plane and its output signals are called products. The second half, NOR-NOT, is the OR-plane, and its output signals are called sums. Synthesis of PLAs is well studied, and efficient algorithms exist [2] , [3] . The area and delay of a dynamic PLA can be easily derived from (1)-(4).
III. WPLA

A. Structure
A WPLA is a four-level cyclic structure depicted in Fig. 2 . The four programmable NOR arrays of a WPLA, labeled 0, 1, 2, and 3, are organized in a cycle. In each array, input signals consist of external inputs as well as outputs from the preceding array. Placing flip-flops (DFF) between arrays 3 and 0 breaks combinational loops. All of the four arrays of the WPLA can output signals with both polarities. Only two metal layers are required to build a WPLA. The logical view of a WPLA is illustrated in Fig. 3 . A NAND (multiple-output) is composed of a NOR together with the half buffers (inverters) driving it and driven by it. Two cascaded NANDS form an SOP. The labels in the figure are defined as follows:
• I j : the set of external input signals i j (1) of array j ;
• Tj : the set of "through" signals tj (1) generated by array j and only used by the next array;
• O j : the set of output only signals o j (1) generated by array j ;
• Bj : the set of output signals bj (1) generated by array j , which are also used by the next array.
The derivation of the area and delay of a WPLA follows (1)- (4) given in Section II. The delay of a WPLA is simply the sum of the delays of its four arrays. The four arrays do not necessarily have the same size. This may lead to some "white" space on the sides of the arrays.
B. Doppio-ESPRESSO, a Four-Level Minimization Algorithm
The basic idea of WPLA synthesis is to minimize a pair of NANDS and iterate for different pairs until no further improvement. Here "improvement" means smaller total area. The possible pairs in the WPLA are 0-1, 1-2, and 2-3. The minimization of a pair of NANDS differs from conventional SOP minimization, since the WPLA structure can invert product terms. Without loss of generality, we discuss the NAND 1 -NAND 2 pair as shown in Fig. 4 .
The following labeling conventions extend the basic definitions given above:
• T Bi = Ti [ Bi ;
• X : the set of literals x(1), the input to the SOP; • P : the set of products of the SOP, p(1);
• S : the set of sums or the output of the SOP, s(1). We use the term "literal" to denote either the noninverted or inverted Boolean values of a variable. The algorithm should not change T B 0 and T B2 because these are constrained by the previous and following stages, respectively. However, the external input signals can be distributed between I 1 and I 2 , and the output signals between BO 1 and O2 . T1 can, and most probably will, change. In fact, besides the SOP minimization, additional optimization comes from these signal redistributions. All optimizations ignore the required polarities of the input and output signals because we can use the input and output buffers to adjust these. In each iteration, a NAND-NAND is transformed to an SOP, an SOP minimizer is applied, and the result transformed back.
The first stage is the NN-to-SOP transformation. The NAND 1 -NAND 2 pair is transformed to an SOP form. No optimization is done at this stage. The output signals, generated by the pair, form the sums
and the input signals of the pair become the input of the SOP
There are two special cases. One is that an I 2 signal in the NAND-NAND form, when becoming an input of the SOP, needs to be inverted, because it is one level back. Another is that a BO 1 signal, when becoming an output of the SOP, needs to be inverted, because it is one level forward.
The second stage is the ESPRESSO SOP minimization. ESPRESSO is called to perform the SOP minimization [2] . It makes no change to X and S ; however, the content of P may change.
The third stage is the SOP-to-NN transformation. The optimization done by Doppio-ESPRESSO, in addition to ESPRESSO, occurs during the transformation from SOP to the NAND pair. Note that T B0 and T B2 must be kept unchanged, due to the structural restriction of the WPLA. The SOP-to-NN transformation first produces the NAND 1 -NAND 2 pair directly from the SOP, mostly by definition.
Step (5) tries to further reduce the circuit size.
1) Collect the invariant sets T B 0 and T B 2 . T B 0 signals are directly collected from the input set X . Collect T B2 from the sums S .
2) Determine B 1 , O 1 , and O 2 . For the remaining sums, those composed of more than one product correspond to O 2 . The sums with only one product correspond to BO1 , but they need inversion. This is the reverse transformation of the second special case in the NN-to-SOP transformation. Recall that we cannot apply such simplification on T B 2 because by definition, T B2 must appear at the output of NAND2. To further distinguish B 1 and O 1 , we check if the single product is used by only one of the sums. If so, it belongs to O1 , otherwise B1 .
3) Determine T1 and I2 . If a product p is an AND of two or more literals, or the AND involves any T B 0 input, it belongs to T 1 . In the former case, p is nontrivial; thus, the corresponding t1 involves a logic operation in NAND 1 . In the latter case, T B 0 is only available at the input of NAND 1 ; therefore, NAND 1 must be traversed. The remaining products are only determined by one external input, and they become I2 .
4) Determine I 20ONLY and I 1 . Since I 1 and I 2 may not be disjoint, we define I 20ONLY as the set of signals i 2 (1) such that i 2 (1) = 2 I 1 . I 2 signals were column singletons in the product matrix, as discussed in Step (3), such that they do not need to appear in NAND 1 but become inputs to NAND2. I1 is derived from figure) . ESPRESSO can do nothing in this example; thus, the first four steps in the SOP-to-NN will return the original structure. However, it is possible to organize a, b, and i into one new NAND1 term and make use of it in NAND 2 . The transformation saves one T B 1 term.
The main idea is to use the following identity:
in which u(j )s are untouched terms, and n is the new term
The saving comes from the deletion of trivial T 1 signals, while the cost is the addition of a new T1 signal n. Since i is non-I20ONLY , meaning that i is also an I 1 signal, it is free to appear in NAND 1 . To maximize the reduction, we seek common patterns in NAND 2 to let multiple NAND 2 outputs share the same new NAND1 terms. Fig. 6 illustrates an example of recognizing common pattern in NAND 2 , in which rows are the outputs and columns are the input literals (positive and negative literals of the same variable occupy different columns). The stars, which represent the care bits of the literals satisfying the above requirement, only appear in the columns of T B 1 (relayed T B 0 ) and I 1 \ I 2 . Then, z 3 and z 5 may share a common pattern, while z2 has no sharing with others. The recognition of common patterns uses a two-step algorithm. The first step recognizes a set of candidate patterns (PSET). The second step seeks a subset of PSET in a greedy way such that the reduction is maximized. Define the reduction and W 1 represent the number of outputs of NAND 2 and the number of inputs of NAND1, respectively. The reduction is measured in terms of number of bits saved in the two arrays.
IV. RPLA
A. Structure
An RPLA is a stack of multiple-output PLAs; adjacent PLAs in the stack are connected via river routing, as shown in Fig. 7 . River routing is special in that any two-pin connection is made using only a single layer, and different connections not overlapped. The left sides of the PLAs are aligned. External inputs enter at the bottom; primary outputs can exit from the right side of any one of the PLAs or from the top of the RPLA. When extracted at the right side, river routing is used to bring the outputs of the PLA OR-planes to the right boundary. The delay-tracking signals are relayed serially through all the arrays.
Both the area and delay of an RPLA are explicitly expressed in terms of the PLA contents using (1)-(4) given in Section II. The nonuniformity of the PLA widths results in some "white" space on their right sides. Because the PLAs are dynamic circuits, the RPLA delay is just a summation of the PLA delays. The delays of the river-routing wires are negligible since they are local connections and have no vias.
The basic RPLA structure can be extended easily to sequential configurations. Flip-flops can be placed at the top of the last PLA, with a third metal layer used to build feedback signal lines. The feedback wires can use river routing as well. We focus on combinational RPLAs in this paper.
B. Algorithm
The design flow for RPLAs contains three steps: multilevel logic minimization; node level-placement; and net ordering. An objective function in the optimization can be evaluated with good accuracy, because the area and delay are fully determined by the logic embedded in the RPLA. In contrast, the areas and delays for standard-cell designs are hard to predict during technology independent logic optimization.
1) Multilevel Logic Minimization.
The multilevel logic minimization step uses sequential interactive synthesis (SIS) [4] . The generated Boolean network consists of single-output nodes (single-output SOPs).
2) Node Level-Placement. The single-output nodes in the network are levelized. Note that a node is an SOP (in general, two logic levels) structure, but the term "level" here means the level of the node in the Boolean network, unless otherwise mentioned; the real number of logic levels can be twice the number of network levels. The number of PLAs in the RPLA is equal to the number of network levels. The nodes on the same level are clustered into a multiple-output PLA in the RPLA. After clustering, each PLA is minimized further by ESPRESSO. Node level-placement is possible for nodes having flexibility in their levels. In general, the flexibilities of different nodes are correlated because of the fanin/fanout relations. Simulated annealing is suitable for the node level-placement since the solution space is reduced by the level dependencies, and the evaluation of area and delay is straightforward. The objective function is a weighted sum of the area and delay.
At each annealing step, two issues need to be considered in computing the area and delay. One is that the content of each PLA in the RPLA is not finalized until an SOP minimization is done. When the PLAs are small, the minimizer can be called in the inner loop of the annealing. However, when the PLAs are large, doing this is too time-consuming. A tradeoff is made; if the original size of the PLA exceeds a threshold, we use the raw content of the PLA to compute area and delay, otherwise, an SOP minimization is performed. Another issue is the area of the river routing region, which affects the total area of the RPLA. The net ordering algorithm, to be discussed later, will guarantee that the thickness of a river routing region, or the number of horizontal wiring tracks between two PLAs, is always linear with the number of outputs from the OR-plane of the previous PLA. The net ordering algorithm might increase the widths of some PLAs by a small amount, but as long as the PLAs affected are not the widest in the stack, this has no effect on the total area of the RPLA.
3) Net Ordering. The sequence of nets in the RPLA is determined such that river routing between any pair of adjacent PLAs is feasible.
Let vS(n) be the starting level of net n, which is the level of the node driving the net. Let v E (n) be the ending level of net n, which is highest level of nodes using n as a fanin. In the RPLA, a net only has one wire, which may cross one or more AND-planes of the PLAs. If a wire starts from the primary input, or v S (n) = 0, it is simply a vertical segment reaching level vE(n). If it starts from the output of the PLA on level v S (n) > 0, it first turns into the region between PLAs v S (n) and v S (n) + 1, and then goes vertically from v S (n) + 1 through v E (n).
Although some vertical wire segments in the AND-planes traverse more than one PLA, they are split by the buffers in the PLAs. Therefore, all the connections are short. The net ordering algorithm first finds a packing of the vertical segments in the AND-planes of the PLAs. The ordering of output signals of the PLAs follow their ordering appearing in the AND-planes, and the output signals reach their vertical segments via river routing immediately after they leave the PLAs generating them. Hence, the thickness of the river routing region between PLAs on level v and v+1 is determined by the number of outputs of the PLA on level v. A minor issue is that one is added to the thickness to account for the "done" signal.
Following is a brief explanation of the algorithm. First, order the nets in descending order of v E (n). For nets with the same v E (n), order them in ascending order of v S (n). There is no restriction on the nets with the same vS(n) and the same vE(n), since they always go side by side. Then from left to right, greedily fill in each vacant slot with the next available segment in the ordered list. The greedy algorithm is equivalent to the left-edge algorithm used in channel routing [16] , which gives an optimum packing of the vertical wire segments in the AND-planes of the PLAs. Finally, the widths of the PLAs are derived. I  TYPICAL PARAMETERS OF THE GATE LIBRARY   TABLE II  TYPICAL PARAMETERS IN THE PLA DESIGNS The width of a PLA is determined by the number of output signals, which is a constant, and the width of the AND-plane. Note that the packing algorithm may generate some empty space in the AND-plane, which causes the width of the PLA to be larger than the estimated value. As long as such PLAs are not the widest in the RPLA stack, all the computations remain the same. Experiments indicate that the case where the widest PLA has empty columns is rare.
V. EXPERIMENTAL RESULTS
We compare the following methods of implementation: standardcells (SCs), network of PLAs (NPLAs) [6] , RPLAs, and WPLAs. A 0.35-m technology was used for the comparisons since a standard-cell library containing over 100 cells was available, and each logic cell has at least two choices of drive strength. Typical parameters of the standard-cell library are given in Table I . Several important parameters of the PLA designs are given in Table II . All delay-related parameters contain a load-independent part (the intrinsic delay) and a load-dependent part.
Standard-cell implementations use over-the-cell routing. Metal-1 is used for internal connections of the cells, and metal-2 and -3 are used for inter-cell connections. NPLAs use metal-1 and -2 for internal connections, and metal-3 and -4 for inter-PLA connections. RPLAs and WPLAs need only metal-1 and -2 for all routing.
Since WPLAs target smaller circuits, the experiment was divided into two parts. In the first part, 15 smaller FSM examples from the LGSynth'91 benchmark set [10] were implemented on all the structures, thus, focusing on examples where WPLAs could be effective. In the second part, 15 CML examples from the same benchmark set were tested on SC, NPLA, and RPLA structures. All synthesis programs were run on the same machine: a Dec Alpha 8400 5/625.
A. Comparison of {WPLA, RPLA, NPLA, SC} on FSM-LGSynth91 Benchmark Set
After the latches are removed from each example (we do not deal with state minimization and encoding), the combinational part was optimized with SIS [4] (using script.rugged) to achieve an initial Boolean network. Then, the network level was constrained to two by using the SIS command "reduce_depth -d 2". Command "map -n1 -AFG" (minimum delay circuit that respects load limit) was used for the technology mapping of SCs; for NPLAs, we clustered all single-output nodes at the same level, and called ESPRESSO with its default settings to minimize the clustered multiple-output PLAs. The RPLA node level-placement used weights of 0.5 on both normalized area and delay. Since area and delay are of different units and orders of magnitude, they were normalized first. A simple approach for this was adopted. After the initial design is created randomly, the area and delay are measured and denoted by A 0 and D 0 . The area and delay in each annealing step will be divided by A 0 and D 0 , respectively.
Areas and delays are given in Table III . No placement or routing was done for SCs and NPLAs, so to be fair, we assume they have an area utilization of 85%. RPLAs and WPLAs areas are the actual areas since they include the river routing and white space. Note that the comparisons are not precise also since SCs and NPLAs use more metal layers than RPLAs and WPLAs. The area utilizations (nonwhite space) of RPLAs and WPLAs are also given in the table.
On average, WPLAs are 47% and 10% smaller than SCs and NPLAs, respectively, but only 9% and 4% slower than SCs and NPLAs. The areas of the RPLAs are on average 17% smaller than those of SCs, and their delays are 12% worse. WPLAs are on average 30% smaller than the RPLAs, when two network levels are used, and 3% faster. Thus, when implementing small circuits, WPLAs are superior. The effect of node level-placement done for RPLAs is greatly limited in the two-level case, because most of the nodes are fixed. The run times of NPLA, WPLA, and SC are similar. RPLA synthesis is on average twice as slow, due to the simulated-annealing used in the node level-placement algorithm.
B. Comparison of {RPLA, NPLA, SC} on CML-LGSynth91 Benchmark Set
Each example started with optimizing the initial Boolean network using SIS constrained to v 0 network levels. Then, the number of levels was reduced gradually. For each number of levels v, for SCs, the circuit was technology mapped twice with SIS: one with area priority (map -m 0 -AF) and one with delay priority (map -n 1 -AFG), generating SC A (v) and SC D (v). The same algorithm as in the first part of the experiment was used to synthesize the NPLAs; however, there was no area or delay priority. Two RPLAs were synthesized, one with 100% weight on area and one with 50% weights on area and delay, generating RPA(v) and RPD(v). The results are listed in Table IV . No placement and routing for SCs and NPLAs were done, but area utilization of was assumed to be 85%. The real areas (including white space) and their area utilizations are listed for the RPLAs. The experimental results show that, in general, there is a delay-area tradeoff; fewer network levels (v) means faster speed but larger area. Compared with SCs (raw area, three layers of metal), RPLAs (two layers) need on average 21% more aerial area while NPLAs (raw area, four layers) need on average 20% more area. Although RPLAs contain on average 20% white space, if the RPLA module is to be embedded in a big chip, other circuits may utilize this space. Another interesting phenomenon is that in small circuits as in the FSM benchmark set, SCs have larger areas than RPLAs. The reason is that during technology independent optimization, the network level was reduced to two, which may result in many duplicated cells in the later mapping stage. In terms of delay, on average RPLAs and NPLAs are 5% and 10% slower, respectively than SCs. Synthesis times for RPLAs are about twice that for SCs, and 50% that for NPLAs. In the RPLA case, the synthesis time is exactly the design time but for SCs and NPLAs, additional time is needed to complete the placement and routing.
VI. CONCLUSION
Two regular circuit structures based on PLAs were presented. The regularity provide accurate area and delay estimation during synthesis. WPLAs provides a compact layout for circuits with sizes of up to about 1 K-gates. A four (logic)-level synthesis algorithm, Doppio-ESPRESSO, takes advantage of the WPLA structure, and can usually produce smaller layouts compared to standard-cell implementations. RPLAs are multilevel structures composed of a stack of PLAs. River routing between adjacent PLAs creates local and regular interconnections. The algorithm for the synthesis of the RPLAs uses the level flexibilities of the nodes and simulated-annealing to find a good combination of area and delay. RPLAs are suitable for circuits with size of up to about 10 K-gates and provide another regular alternative to the standard-cell structure.
For larger circuits, many WPLAs and/or RPLAs might be involved. A mixture of standard-cells, WPLAs, and RPLAs on a single chip is possible. Such applications may require a mixed macro/standard-cell physical design flow. Macro-cell placement and mixed macro/standard-cell placement algorithms are available [11] , [12] . To maintain global regularity, a regular block-level placement and routing scheme might be adopted [14] .
Various sizing and precharging configuration techniques can be applied to the structures of WPLAs and RPLAs to get a tradeoff between power, area, and delay [13] . No specific power reductions have been implemented in the WPLAs and RPLAs, but the regularity allows easy power estimation. Noise problems are also increasingly important in DSM IC designs. A PLA-based structure with shielding power/ground lines was reported [6] . This technique can be directly adopted in the WPLAs and RPLAs. Although the extension of WPLAs and RPLAs to reconfigurable versions is straightforward, the problem becomes one of mapping given logic functions onto a fixed-size WPLA or RPLA structures. Glacier PLAs, a reconfigurable version of RPLA, have been explored [15] .
