Abstract
Introduction
A conventional method of implementing FSMs is to use standard-cells and a design flow (pssibly iterated) of logic synthesis, technology mapping, and physical design, such as placement and routing. In deepsubmicron (DSM) designs, such a design flow exacerbates the timing closure problem [6] . It has become widely accepted that regular circuit and layout structures are a means of alleviating the problem [14] . Generally, regular structured circuits, such as memory A new regular structure is proposed and synthesis methods for this are given. It is a cyclic four-level programmable array, called a Whirlpool Programmable Logic Array (WPLA). Since this cascaded NOR smchue allows binary inputs to each plane, it extends the conventional Sum-of-Products (SOP) form. An algorithm called Doppio-ESPRESSO is developed to synthesize logic into WPLAs.
Unlike ESPRESSO [Z], which could be used to minimize two two-level circuits separately, Doppio-ESPRESSO uses the extra smctural flexibility in WPLAs for further optimization. An important feature is that after logic minimization, the layout is completely determined. No technology mapping, placement or routing is needed for the WPLA; neither is prediction necessary, because area and delay are solely deterinined by the logic embedded in the WPLA. Another interesting feahlre is that ana and delay minimization rarely conflict in this structure (primarily because 4-level logic is required). If the delay requirements are not satisfied, the user's only option is to modify the specification rather than re-running many optimizations with different synthesis parameten.
The WPLA structure is suitable for circuits with up to several thousand gates. Therefore it can be a building block on the chip. Blocklevel placement and routing are still needed to complete the interconnections between the WPLAs. To maintain global regularity, regular global interconnections are desired. Fortunately, both a blocklevel placer and a regular global wiring scheme have been reported [16, 14] . However, the question remains of how to optimally partition the circuit into pieces that can fit the size requirements of WPLAs. In this paper, we focus on the synthesis of WPLAs.
The paper is organized as follows. In Section 2, the WPLA smcture is described, and area and delay computations are given. In Section 3, a synthesis algorithm for the WPLA structure is detailed.
Section 4 gives some experimental results, and Section 5 concludes.
The circuit structure of the W P L A
The WPLA smcture is shown in Figure 1 . Two cascaded NOR gates, together with the buffers, can be modeled as a NAND-NAND structure, shown in Figure 2 , which is equivalent to a SOP. However, the inputs to the NAND gates can have both polarities available by choosing buffer polarities The signals of a WPLA are divided into four categories. r(.) denotes the set of signals used only internally by the next NAND. B(.) denotes the set of primary outputs that also feed the next NAND. q.) are the primary outputs that do not fanout to the next NAND. /(.) are the primary inputs. The union of U,) and B(.) is abbreviated by TB(.).
Similar abbreviations include BO(.) and T B q . ) .
The width and height of a plane are: 
in which, K,, Kp are coefficients determined by the technology, DBu~, and DnwM are the inhinsic (load independent) delays of the input and intermediate buffers, dnuF, and dnuFM are the load dependent delays of the two kinds of buffers, LBuF,(.) and LBwd.) are the loads of the corresponding buffers, and e~ is the density of the plane that can be derived from the bit map ofthe plane. The formula does not include the set-up and hold times of the latches. Although more precise delay formulations can be used, the essential point is that there are no extra elements to predict in the delay computation, given the logic implemented in the WPLA.
The delay formulation shows that reducing size can usually reduce delay, if eN does not grow fast at the same. time. This characteristic makes the design flow straightforward; synthesis algorithms only need to focus on minimizing the area since usually delay is minimized as well. This can be explained by two factors; 1) the number of logic levels' is fired, and 2) uniform buffering is used in WPLAs. In a standard-cell design, collapsing nodes on a critical path can reduce the logic levels in the hope of reducing delay, essentially introducing more parallelism. However, real gates have limited drive capacities. Increasing parallelism means larger loading; hence buffers are inserted, or driving gates themselves are duplicated. Inserting buffers may introduce additional delays; duplicating gatcs actually shifts the load burden backwards. In addition, such timing optimization may trigger an unexpected blow-up in area. Placement and routing factors may further complicate the problem. When a standard-cell implementation does not meet delay requirements, it is difficult to decide whether further collapsing should be done and if so, how. The WPLA synthesis approach has no such scenario.
3. Doppio-ESPRESSO, a four-level minimization algorithm
Overview
The basic idea of WPLA synthesis is to minimize a pair of NANDs, and iterate for different pairs until no further improvement. The possible pairs in the WPLA are 0-1, 1-2 and 2-3. The (I) The use of negative products is forbidden.
(2) The products output to the sums " l y .
(3) The sums only take the products as inputs.
The transformation algorithm should generate and accept the SOP form with these rcshictions. In addition, the polarities of the primary inputs and outputs can be obtained by using the appropriate input and output buffers. So if the original function outputs signal Z, it is passible to end up with an optimized function of the complement 2
NNZSOP, the NAND-NAND to SOP transformation
We start with the simplest case, that is, /(2)==0, BO(1)==0 and the nand2 matrix is positive unate. Then the transformation is simply copying the nandl matrix to the product matrix and copying the transpose2 of the nand2 matrix to the sum matrix. Suppose BO(1p-S. This is one of the major structural differences between WPLAs and conventional PLAs. Since the SOP form can only output from the sums, the B q I ) signals have to be raised to O(2). Suppose Y is a BO (I) signal. We can create a new O(2) signal, I, which is simply the complement of U, but now corresponds to an output of the sum. (1)
However as mentioned above, xj;s have to be relayed to T(1); thus their polarities should be adjusted. Now the steps of the @ansfonnation to SOP fonn are described.
Step 1. The inputs of the product matrix include TB(0) and l(l)ul (2) .
The outputs of the sum matrix include TB (2) , O (2) and Bo(1) ' .
Step 2. Copy nandl to the product matrix.
Step 3. For each input signal in the product matrix, build two rows, one with a '0' in the column of that signal, and one with a 'I' in the column. By doing so, all the inputs of the product matrix, including TB(0) and I(l)uI (2) It is obvious that all the p s e u d e q l ) signals are not always utilized. But keeping them does not affect the SOP minimization. A more succinct SOP representation, with all unused pseudo-T(l) rows removed, looks like:
ESPRESSO, t h e SOP minimization algorithm
ESPRESSO is employed to perform the SOP minimization [2, 5] . It makes no change to the input net list and the output net list. However, the content of the product and sum matrices may change. In the example, ESPRESSO gives the following optimized SOP form. new distributions as well. We will show that the re-distribution provides additional oppomnities to optimize the logic functions. However, TB(0) and TB (2) are unchanged, because these signals are fixed, due to the shllcNral restriction of the WPLA.
The algorithm consists of two parts. The first part, Steps I to 4, produces the nandl and nand2 matrices from the SOP. These steps are mostly done by definition. Then the nandl-nand2 is further optimized using Steps 5 to 7.
Step 1. This step distinguishes B(1). O(l) and O (2) . In the sum matrix, the TB(2) columns are left alone. In the remaining columns, the ones with a single 1 become BO (I) , because these columns correspond to relays. The others are O(2). Shade the Bo (1) Step 2. This step recognizes I(2) and T(1,. In the product matrix, leave alone the BO(1) rows. Check the remaining rows. If a row has care bit@) in TB(0) columns, or it has two or more care bits in the row, then the row is associated with a signal. The remaining rows, with single '0' or '1' in the non-TB(0) columns are 42). Shade these rows, and label the CorresDondinr columns with K2).
-
Step 3. This step identifies 1(2)-only signals, because some /(2) can also be /(I), if they are used by both the nandl and nand2. Check each column that has been identified as I(2). An I(Z)-only signal requires that each care bit appearing in the column should be the row singleton.
Otherwise it is also /(I). In the example, signal e is identified as an I(Z)-only signal. Shade all the 1(2)-0nly columns in the product matrix.
Step 4. The non-shaded region in the product matrix is copied to nandl, and the transpose ofthe non-shaded region in the sum matrix is copied to nand2, but the l(2) care bits should be inverted. After the operation, the TB(1) columns in the nand2 matrix should contain no WE.
".dl
So far, the optimization comes only from the SOP minimization.
Further optimization is possible. Suppose:
z =n y , n T,
where, T, = y j , is a single-literal fimction in the nand1 and the literal xj is a TB(0). Then all Ts can be combined in nandl using a new VI)
signal N: signals. Also in the nand2 matrix, shade the I(2) columns except the 1(2)-onlv columns.
All the shaded columns in nand2 form a submahix S.
kl;,
Step 6. To maximize the reduction, we identify common patterns in the S matrix. To eliminate a column, all the care bits in the column should be covered by some selected pattem(s). Denote the number of VI)
columns eliminated by C , and the number of I(2) columns eliminated by C,. Eliminating these columns will save C+C, columns in nand2 and C, rows in nandl, at the expense of creating Rp new rows in nandl and Rp new columns in nand2. Here R p is the number of patterns used. Define goin as the total reduction in size of the two matrices:
@ n = ( C T + C , -R , ) H 2 + ( C r -R p ) K
where H2 is the height of the nand2 matrix, and W, is the width of the nandl matrix. To simply the pattern recognition when different polarities may exist in the same signal, the S matrix is expressed in a panem matrix Sp as shown below.
Re-arranging rows and columns, we get:
Each column in S is split into two, one for the positive literal, and one for the negative literal. Then the care bits are replaced by *'S. The algorithm has two parts. The first collects a set candidate patterns for the covering, and the second selects a subset that maximizes the gain. Define the size of a pattern as the product of the number of *'s in the column and the number of columns in matrix S ' that are covered by this pattern. Then a seed pattern, po3 is chosen from the candidate set, which gives the highest gain. Notice that the highest gain provided by po alone might not be positive, because R p = 1, while CT and C~might both be 0 at this moment. If a tie occurs, choose the larger pattern. Further ties can be broken bv choosine the one with the larger number In this example, two patterns are chosen as shown below.
S.
Remove the columns in nand2 covered by the chosen patterns, and replace their functions with new T(I) signals. In the example, column j2, jr, d, b and/in nand2 are removed, and mo and ml are created.
~ 547 in the IOW, which means that it can be pulled back to nandl and become a O(1). When the row is saved, it might generate empty columns in the nand2 matrix, ml in this case. Then the m, column can be saved. This operation concludes the SOP2NN Ransformation and optimization algorithm; the final nandl-nand2 matrices are shown below.
Finally we __ give the original logic functions: 
Experimental results
We compare the following methods of implementation: standard-cells (SCs), network of PLAs (NPLAs) [7] , River PLAs (RPLAs) [I31 and Whirlpool PLAs (WPLAs) . An NPLA can be regarded as an intermediatc representation between technology independent and technology dependent logic optimizations [4] . The RPLA is a regular smcture composed of a stack of PLAs; the adjacent PLAs are connected via river routing. Logically it represents a multi-level Bwlean network. In fact, a depfh=2 NPLA or RPLA i s logically similar to the WPLA, except that I) W L A s can have primary outputs directly fmm the product terms and 2 ) the product terms can appear m bath polarities. The deplh=l NPLA and RPLA degrade to a single PLA. A 0.35-micron technology was used for the comparisons, since a standard-cell gate library was available for this, with over 100 gates, and each logic gate has at least two choices of drive strength Typical parameters of the gate library are given in Table I Standard-cell implementations use over-the-cell-routing. Since the gates use metal-l for internal connections, metal-2 and -3 are needed
Step 1. Check if an O(2) signal now becomes a row singleton in the nand2 matrix. For instance, O(2) signal x now has only a single care bit for inter-gate connections. NPLAs use metal-I and -2 for internal connections, so the NPLA needs metal-3 and 4 for inter-PLA connections. The RPLAs only need metal-l and -2 for muting. A WPLA uses only metal-I and -2. Some rypical parameters in the PLA designs are given in Table 2 were tested. After the latches are removed from each example, (we do not deal with state minimization and encoding), the combinational part is optimized with SIS [I] (using scripf.nrgged) to achieve an initial Boolean network with depth do. Then for SC, NPLA and RPLA synthesis, we generate areddelay trade-off curves, by decreasing the depth gradually from do using the SIS command "reduce-depth 4 8'.
At each depth d ', for SC we use SIS "map -nl -AFG" command (minimum delay circuit that respects load limit) for technology mapping; for NPLAs, we cluster all single-output nodes at the same level, and call ESPRESSO with its default settings to minimize the clustered multiple-output PLAs. The RPLAs are synthesized with its own algorithm [13]. WPLAs have a fixed depth of 2, so it only has one solution per example (no area-delay hand-off) In Table 3 , the number of progtmmable bits of NPLAs, RPLAs (both deprh=2) and WPLAs are compared. In this case, both the NPLA and RPLA are four-level structures. However due to the different algorithms used to synthesize them, their results are slightly different. The differences in the bit numbers show the additional improvement achieved by the Doppio-ESPRESSO algorithm; Doppio-ESPRESSO achieves on average 20% more optimization than ESPRESSO.
However, fewer programmable bits do not necessarily imply smaller areas, because PLA snucNres also contain components such as buffers etc. For WPLAs, there can be "white space" along the boundary and in the center, as illustrated in Figure 1 example ( Table 3 . Bit counts olNPLA, RPLA (depd=2) and WPLA Areddelay and synthesis times are @yen in Table 4 . The "techindep. depth" refers to the depth during the technology-independent logic minimization. The values in the column are exact for the NPLA, RPLA and WPLA, since their logic levels will not change. However for the SC, there is a technology-mapping step after that, and the gate levels (including buffers when calculating levels) are shown in the "SC gate level" column. No placement or muting has been done for SCs and NPLAs, so these areas are just the raw areas of the logic components.
Although we can assume that routing is done on higher metal layers, in ' The SIS"nduce-depth 4 X" command may not always reduce the depth to the designated value x, but to some value no greater than x.
~ 548 reality, SCs may need cap cells on both sides of the rows and feed-thm cells. NPLAs require block-level placement, which may generate white space. The RPLAs have their finalized layouts, which contain white space, so they give fair comparisons. In addition, the delays of the SCs and NPLAs may change after routing, due to pmsitics on wires. In contrast, WPLAs consume no additional area nor have additional delay uncertainties.
For a bener view of the experimental results, the areddelay data of the SCs, NPLAs and RPLAs are normalized with respect to the WPLA results and plotted in Figure 3 . The (1,l) point represents the WPLA single point for all examples. Connected points for SC, NPLA or RF'LA represent areddelay trade-off cumes for a single example. Figure 3 shows that SCs generally have larger areas than WPLAs, but can provide smaller delays if more area is allowed. NPLAs are just the opposite; they can provide smaller (raw) areas, but usually are slower.
Comparing the depfh = 2 cases, on average, WPLAs are 37% and 0%
smaller than SCs and NPLAs respectively, but only 5% and 3% slower than SCs and NPLAs. However, recall that the areas of SCs and NPLAs only account for raw logic components and use more metal layers. After placement, the areas of both are expected to grow, especially NPLAs. So in reality, these areddelay curves would shift to the right relative to the WPLA point. The WPLA is on average 56% smaller than the depfh=2 RPLA and 13% faster than it. The RPLAs are not expected to be as useful in implementing small circuits such as those in this experiment [13], because when the circuit is small or the depth is small, the river routing region may occupy a large portion of the entire RPLA area. A rough estimate of the river routing area can be obtained by the difference between the area of the RPLA and the NPLA. Comparing the area of SC, NPLA, RPLA and WPLA with similar delays (may have different depths), we find that WPLA is on average 19% larger than NPLA (raw area), 26% smaller than SC and 32% smaller than RPLA.
Note that some SC, NPLA and RPLA curves are not monotone decreasing with area; thus reducing the depth may not necessarily lead to faster circuits. Other curves are unpredictable in shape, so timing closure becomes even more difficult. Thus, in addition to uncertainty caused by physical design, areddelay relations of SCs, NPLAs and RPLAs are also unpredictable, while WPLAs do not suffer from such problems.
We also found that the number of gate levels after technology mapping is non-linear to the depth of the technolagy-independent optimized circuit, and the relationship is not even monotonic. An interesting phenomenon is that in some circuits like "s838.1", when the depth is reduced, the acNal number of gate levels increases. This can be explained by two factors. One is From the covering in the technology mapping. Suppose the classical tree covering is used, where the technology-independent optimized netlist is first transformed into a generic netlist with only nand2's and inverten. If the depth is not small, the level of the generic netlist follows the depth quite well. But when the depth is very small, the nodes in the netlist are large, and many levels of nand2's and inverters have to be used to represent them. This makes the levels of the generic netlist and thus the mapped gate netlist almost unpredictable. The other factor is the loading problem. As the depth goes down, it is conceivable that the loads (on nets between nodes, and the SOP connections within nodes) tend to increase. To obey the load limit and improve speed, appropriate buffering should be done during technology mapping, which also increases the levels of gates. This shows that even within logic synthesis, the technologyindependent step has difficulty predicting the behavior of the technology-dependent step. The relationship between depth, gate levels, area and delay is complicated.
Logic synthesis times for NPLAs and WPLAs are usually smaller than SCs, because SCs need a technology mapping stage, which becomes notably slower as the circuit size increases. The RPLA synthesis times are the slowest, due to its iterative node-placement algorithm [13]. Experimental results show that WPLAs are quite competitive, in terms of area and delay, with standard cell implementations and network of PLA implementations, but are much more regular and predictable. It also is superior to another regular shucture, the River PLA, in both area and delay, for the examples tested. A compm'son between WPLAs and depth = 2 NPLAs and RPLAs also shows the advantage of the Doppio-ESPRESSO algorithm in terms of the total number of programmable bits needed to build a circuit. However, some remaining problems require more discussion:
(1) The regularity of a chip involves both local and global regularity.
The WPLA provides a smcture with local regularity. However to integrate multiple WPLAs on a chip and achieve global regularity is not easy. 
