Two new techniques for mapping circuits are p r oposed in this paper. The rst method, called the odd-level transistor replacement OTR method, has a goal that is similar to that of technology mapping, but without the restriction of a xed library size, and maps a circuit to a virtual library of complex static CMOS gates. The second technique, the Static CMOS PTL method, uses a mix of static CMOS and pass transistor logic PTL to realize the circuit, and utilizes the relation between PTL and binary decision diagrams. The methods are very e cient and can handle all of the ISCAS'85 benchmark circuits in minutes. A c omparison of the results with traditional technology mapping using SIS on di erent libraries shows an average delay reduction above 18 for OTR, and an average delay reduction above 35 for the Static CMOS PTL method, with signi cant savings in the area.
to work with the traditional design methodology, so that in case a library is available, the complex gates may b e implemented using either the cells in the given library or the virtual library.
One barrier to this dynamic library approach relates to the quality of the layout generator and the accuracy of timing characterization for the cells generated on the y. An argument that is often made in favor of a static precharacterized library is that the layout and timing estimates for these cells are much more accurate.
However, it is our contention that the limitations of these libraries motivate a more serious look at dynamic cell generation. We note that as layout synthesis systems have become more mature, module generators have become capable of synthesizing layouts for arbitrary complex gates more accurately and e ciently. Several industrial layout synthesis tools such as 7, 17 have been proposed to overcome this barrier to help make the application of dynamic libraries possible. For example, C5M 7 has successfully been used to design a 400 MHz processor, and LAS 17 is a commercial tool that can be used to customize standard complex and PTL cells and digital random logic blocks with up to 60,000 gates. It was reported in 7 that the quality of the layout generation for a circuit with complex gates is comparable to the quality of hand-crafted layout and the timing information of the complex gates maintain a high delity b e t ween pre-layout and post-layout generation. Therefore, for our application, we can expect that module generators like those described above m a y be used to generate layouts that are consistent with the technology mapping.
The matching and covering methods used by traditional technology mappers such as MIS 4 cannot be applied on the virtual library since the number of possible templates is far too large. A pattern Boolean matching method, which compares the pattern logic function of the cells in the library with the pattern Boolean representation of the target circuit, operates by c hoosing a covering that is the best solution over all possible matchings. Clearly, both the quality of the results and the CPU time are related to the size of the library.
There has been little prior work that solves the problem of library-less mapping. The work in 12 traverses the circuit and chooses a succession of windows within which to perform local resynthesis operations with sizing, using a sequence of locally optimal decisions to nd a solution to the problem. Our technique, in contrast, uses dynamic programming to consider the implications of a decision on the entire circuit at the same time. Other related work includes 1, 2 , which builds a graph representation for a logic expression and decomposes it into subexpressions; each subexpression maps on to a complex gate. The mapping algorithm works on the structure of the logic expression graph, and delays are accounted for using an RC model. The work in TABA 22 uses a BDD representation for the network, decomposes the BDD into parts and then nds a mapping for each part based on the BDD representations. TABA maps a set of Boolean equations onto a set of static CMOS complex gates under a given constraints in the number of serial transistors, and represents the latest results, prior to ours, for library-less mapping method in the literature. Another approach 23 starts with a tree of NAND and NOR nodes for the logic expression and then performs dynamic tree covering on this tree.
Our approach to solving the library-less mapping problem avoids manipulating logic expressions, and instead, uses a simple and straightforward method that is entirely topological in nature. Moreover, we use more accurate delay models than those used in past approaches.
Pass transistor logic mapping
Pass transistor logic PTL has recently emerged as a signi cant alternative to full static CMOS since it has the capability to implement a logic function with a smaller number of transistors, smaller delay and power dissipation. Several pieces of work on PTL research of realizing circuits have been published during last few years for example, 16,21,25,27,28,34, 3 5 , demonstrating the viability of this technology.
However, there has been relatively little work on design automation for PTL at the technology mapping stage, and this work is among the rst published techniques to incorporate pass transistor logic with technology mapping issues. A related body of work is a recent heuristic approach to logic synthesis for PTL 5, 6 , which is more focused on the logic synthesis aspects and uses a simple delay model that estimates the delay using the number of transistors on a PTL chain. The work in 19 uses a sequence of heuristic steps to perform performance-driven PTL synthesis with bu ers inserted to control the length of a PTL chain, using Elmore time constants to estimate the delay. The approach presented in 31 33 uses BDD's to represent the logic expression and decomposes a BDD into AND OR and XOR XNOR, MUX parts; it then uses static CMOS gates to realize the AND OR parts and uses PTL for the XOR parts. Our work is di erentiated from this research in that we use a more sophisticated SPICE-calibrated delay model, and use a dynamic programming approach instead of locally greedy optimizations. Moreover, we apply PTL to realize any t ype of complex gates, and not merely XOR gates.
Our work mixes PTL and CMOS design style to take advantage of the advantages of both. The idea of mixing PTL and CMOS is not new, and the bene t of this strategy has been proved by a fabricated chip in 30 . The basic di erence between our work and theirs is the strategy used to determine the PTL segments of the circuit.
The work in 30 treats those parts in the BDD representation with a input xed to V dd or ground as candidates to be implemented by CMOS. A similar goal was pursued in a recently published paper 11 , where a greedy algorithm was used to create as large a PTL block a s possible, relying on bu er insertion to enforce limits on the maximum number of series transistors permitted in the PTL segment. One di erence, however, is that their work generates layouts for the circuits as well, a task that is not addressed in this work.
Contributions of this work
In this work, we propose a dynamic programming framework for technology mapping for a library-less environment and for PTL. Our work on library-less mapping focuses on the formation of gates, a procedure we call gate collapsing, which collapses smaller gates in a decomposed circuit into more complex gates. Our work does not address the issue of module generation for the layout of these cells.
The basic idea of gate collapsing has been used by traditional techniques for technology mapping that performed local gate collapsing through pattern matching to improve the circuit performance. The word local" refers to the fact that the collapsed gates are constrained to belong to the available cell library in these approaches. In contrast, our approach of global gate collapsing does not tie the list of permissible gates to any speci c library.
Our procedure works on a virtual library that is assumed to have all possible cell types, so that the global gate collapsing technique can have the full exibility of nding the optimum possible combination of standard gates in a network.
The input to global gate collapsing comes from the output of technology-independent optimization, and the result of the procedure is a network where the input netlist is collapsed into an optimal set of complex gates corresponding to that decomposition. This technique can result in a solution that can be optimized for various objectives such as minimizing the circuit delay or the circuit area, or the power dissipation, etc.
For PTL mapping, our approach uses dynamic programming techniques to partition the circuit into PTL segments separated by static CMOS gates. The basic unit in a PTL circuit is a multiplexor, and there is a close relationship between the BDD representation of a circuit and its PTL implementation. Our method dynamically builds BDD's of logic functions and nds an optimal mapping, under the constraint that the number of PTL transistors in series must be constrained never to exceed a user-speci ed number.
Our work uses accurate delay models calibrated using SPICE. An exhaustive set of SPICE simulations is performed to characterize complex gates and PTL and an accurate look-up table is constructed, listing the gate delay as a function of parameters such as the input signal transition time, the load, the transistor sizes and position of the switching transistor within the gate. An additional contribution of this work is a new technique that is employed to reduce the size of the look-up table and the corresponding memory overhead.
The organization of the paper is as follows. The rst technique for gate collapsing, called the OTR method, is described in Section 2. The method is based on an observation that uses the topological properties of the circuit in collapsing complex gates in a computationally e cient manner. Next, we consider the problem of mixed static CMOS PTL mapping in Section 3, using the relationship between PTL structures and BDD's. Experimental results are presented after each method, and the paper ends with concluding remarks in Section 4.
2 Odd-level Transistor Replacement OTR Method
An Example
We will now present a method for building complex gates, based on a simple topological technique that permits subcircuits with an odd number of gate levels to be collapsed into a single complex gate.
The basic idea of the OTR method is to use the pull-down pull-up transistor structure from the gates at the previous level gates to replace the pull-up pull-down transistors of the gates at the next level. To illustrate this, consider the circuit in Figure 2a consisting of gates G1 through G7. This structure has 20 transistors in all, and a transistor-level version is shown in Figure 2b . During the procedure of transforming the circuit into a complex gate, we will need to generate intermediate gates shown in Figure 3a for temporary use. Those intermediate gates will be transformed into a normal static CMOS gate at the end of transformation.
As shown in the gure, we will refer to the pull-down and pull-up transistor in G1 G2 as a n and a p b n and ,7 Figure 3a . For example, the pull-down blocks of G1 and G2 fan out to the pull-up transistors p 1;5 and p 2;5 in G5, respectively, and hence a n and b n are inserted in their place to create G5'. Similarly, the transistors in G3 and Figure 3b . Note that the nal implementation has only 8 transistors, a transistor count reduction of 60.
From the principle illustrated in this example, it is easy to see that if we collapse an even numb e r o f l e v els of gates, we will be left with an intermediate static CMOS gate, whereas if we collapse an odd number of levels, we will return to the formal CMOS complex gate structure, and therefore we call this technique the odd-level 
Proof of Logic Correctness
Before beginning this proof, it is important to state that the OTR technique works when the network is entirely speci ed in terms of inverting gates, as is the case in any CMOS implementation. When the circuit is speci ed in terms of noninverting gates, the rst step would be to convert all noninverting gates into an inverting gate followed by a n i n verter, and then apply the OTR method.
Theorem:
1 On completion of the OTR procedure, all gates are t r ansformed into complex traditional static CMOS gates.
2 The OTR method p r eserves the logic function fx 1 ; x 2 ; ; x n of the original circuit. Proof: The rst part of this proof is easy to see, since by construction, the pull-up of the nal gate consists purely of pMOS transistors and the n part consists purely of nMOS transistors, and each pMOS structure in the pull-up will have a dual nMOS structure in the pull-down. Therefore, the nal result will be a traditional static CMOS gate.
We prove the second result on the reduction of a three-level subcircuit to one level. The proof for other odd numbers of levels l can be deduced from this proof in a constructive manner by applying this procedure to reduce the number of levels in steps to l , 2; l , 4; ; 1.
For the circuit con guration shown in Figure 4 , we label the levels from 1 to 3 as shown. We will consider the situation when the output of level 3 is at logic 1; the proof for the logic 0 case is analogous. Since the level 3 output is at logic 1, it implies that there is a set of pMOS transistors that provides at least one pull-up path between V dd and the output node. We will show that under the OTR scheme, the new gate will also have a pull-up path corresponding to each of these pull-up paths.
Without loss of generality, w e m a y consider any one of these paths, P . Before proceeding, we note that in the original circuit, each of the inputs g 1 ; g 2 ; ; g k of P is connected to a pull-down path in the previous level that is connected to ground, i.e., g 1^g2 g k = 0
Each transistor on P is replaced by transistor segments from the intermediate nontraditional CMOS gate by applying the OTR procedure, which replaces that transistor with the pull-down of the preceding intermediate gate.
To show that after modi cations, the path P continues to provide a pull-up path between V dd and the output node, it su ces to show that each such preceding intermediate gate has a path from ground to its output node.
We will call this Requirement .
Consider any such level 2 gate g i in the original circuit that excites a transistor on path P . If the level 3 output is high, then it must be true that there is one or more path in gate g i that connects the output node to ground in the original circuit. Without loss of generality, w e consider any one such path Q i . Then it must be true that 
Delay Estimation
In this section, we describe the technique used in this work for delay calculation for complex gates, including a new method used to reduce the amount of storage for a look-up table based approach, while maintaining accuracy.
In order to calculate delay information accurately for complex gates, the rise delay and fall delay of the complex gates are characterized in the look-up table as a function of four parameters 1 the position of the switching transistor 2 the transistor size 3 the input slope S, and 4 the loading capacitance C. The switching position here refers to which transistor in a series chain causes a gate to switch. For example, for the fall transition of a NAND gate, it is possible that the switching may be caused by a transition at any one of the three transistors in the pull-down chain, each of which will lead to di erent delays. Therefore, we incorporate this information into the look-up table by parameterizing it by the location of the switching transistor. In our implementation, we assume that each transistor in a gate has the same size. This makes the task of layout easier, and compacts the size of the look-up table, although it is possible that some further timing improvements may be facilitated by allowing transistors to be sized individually. Our experimental results show that even under our implementational assumption, substantial area performance improvements are possible. Moreover, the theoretical framework presented here can be extended to the case of nonuniform sizes.
Given a switching position and a transistor size, a traditional look-up table is a two-dimensional array o f v alues parameterized by S and C, as shown in Figure 5a . We maintain a set of these two-dimensional tables to record this delay information. The number of such tables is given by t wice the product of the total number of switching transistor positions and the number of possible transistor sizes, with the factor of two corresponding to separate tables for the rise and fall delays. This look-up table scheme requires a large amount of memory which can make where , , , ! are constants. However, if we attempt to nd a single delay equation for the entire table, the accuracy of the characterization may be poor. Therefore, we use a set of equations that capture the information embedded in a subset of the data, ensuring that the accuracy of each such t is within a prescribed range, .
The entire data can be tted accurately to a small set of delay equations, and any data points that have an error larger than from the set of equations are stored as pure data. The overall structure of the storage is as shown in Figure 5b .
Our experimental results show that we can use the delay equation to represent about 75 of the delay data points by using 4-10 di erent sets of coe cient for each t wo-dimensional table within an error of 5.
The procedure for nding the values of ; ; and ! requires a least-squares minimization of the following form:
capacitance, slope and delay corresponding to the i th data point, respectively. This unconstrained minimization can be performed by setting the partial derivatives of F with respect to each of the parameters to zero, i.e., @F @ = 0 ; @F @ = 0 ; @F @ = 0 ; @F @! = 0 . This yields a system of linear equations that is solved to nd the values of ; ; and !.
Outline of the Algorithm
We now present a dynamic programming based approach to solve the problem of area minimization under delay constraints. To appreciate the di culty of this problem, we point out that technology mapping, a special case of global gate-collapsing, is known to be NP-complete for directed acyclic graph structures 15 . A technique that has been routinely and successfully used in technology mapping is to decompose a DAG i n to a set of trees and to perform mapping on those trees for example, in 8, 10, 15 , with the trees being selected in such a w ay that they are all rooted at gates with multiple fanouts or at gates at the primary output. We persist with this approach in our work.
It is worth pointing out that an alternative class of approaches to technology mapping begins by merging each fanout-free region into a complex gate, so that the circuit consists only of multi-fanout complex gates; the technology mapping problem can then be treated as the problem of decomposition of these gates into complex gates. We do not use this method for two reasons. First, the computational complexity of decomposing the complex gate is exponential in the number of its inputs, and there are no optimal techniques for nding a decomposition of such a network. Second, since the decomposition method only handles fanout-free regions at a time, it is essentially similar to tree mapping in that respect. It is an open question as to which kind of decomposition will eventually yield the best results, and therefore, in our work, we assume that the initial circuit has already been decomposed into a 2-input NAND gate and inverter network.
The algorithm is based on dynamic programming and uses OTR combinations to generate possible complex gates within each tree. As in 8 , we begin with a 2-input NAND gate and inverter decomposition of the circuit.
While other decompositions may also be used, the purpose of using a 2-input NAND gate decomposition is to increase the granularity o f the initial circuit, to provide more freedom to the OTR procedure in generating arbitrary complex gates. In contrast, a coarse-grained approach will allow a smaller degree of exibility in gate collapsing. The pseudocode for the algorithm is follows:
Algorithm Outline
Input:Initial circuit decomposed into inverters and 2-input NAND gates. Output: Optimum network of complex gates. f levelize the circuit find roots sort roots from primary inputs to primary outputs for each root generate tree apply dynamic programming for each node in the tree from leaf nodes to the root find all possible collapsing solutions store non inferior solutions Area, Delay find optimum solution based on all generated noninferior states g An explanation of the pseudocode is as follows. The circuit is rst levelized to nd the level number for each gate, which is the maximum number of gates between the primary inputs and the gate output. Next, the procedure find roots is invoked to split the DAG circuit structure into a forest of trees. The function sort roots then arranges the roots of these trees according to their level number. The trees are processed in order of the level number of their roots, thereby ensuring that before each tree is considered, all of its fanin nodes have been processed.
The dynamic programming procedure 9 proceeds by associating a set of states with each node, where a node corresponds to a gate output. A state is a partial solution that relates to a possible con guration of collapsed gates for the subtree rooted at that node. The state information for each node is a pair Area,Delay , calculated from the primary inputs up to that node. The method can easily be extended to consider measures such a s p o wer in this framework. The Area at a node g is given by the sum of the Area of a candidate complex gate with output g, and the node Area for all possible states at the fanin nodes of the current complex gate. The complex gates are chosen so that the number of series-connected MOSFETs on a path to V dd or ground does not exceed a user-speci ed numberk.
The algorithm consists of two phases. Phase one is a postorder traversal from the leaves to the root of the tree, during which dynamic programming proceeds by e n umerating the possible states at a node, and eliminating all states in a partial solution that are provably suboptimal. For example, a state Area,Delay is provably inferior if there exists another state Area',Delay' such that Area Area' and Delay Delay'. The pruned list of possible states at each node are used as candidate states at the next node, and so on. Under this basic framework, the dynamic programming procedure stores only the noninferior Area,Delay combinations and proceeds in a manner that is fundamentally similar to that in 8 . Due to limitations of space, we do not describe any further details here.
Phase two is a preorder traversal from the root node to the leaves, where the best solution is chosen from the set of solutions generated from phase one. When all noninferior states have been enumerated, the optimal state is chosen and the corresponding circuit con guration is determined. An outline of the computational complexity is provided after the pseudocode for the Static CMOS PTL method described in Section 3.
The delay calculation in our work is very similar to that in 8 . When we process node g during the postorder traversal, the load of node g is unknown. Hence we assume a typical value for the load of node g and nd a set of solutions. After the fanout of node g has been handled, we know the exact load for node g, and we m a y then perform a timing recalculation for node g according to the actual load value. The OTR procedure fundamentally requires the presence of three levels of logic on which the transform may be applied. However, our dynamic programming procedure adapts this to consider the collapsing of two levels of gates as well. Consider the situation shown in Figure 6a , with two levels of logic. The output wire is logically equivalent t o t wo i n verters in series, and therefore, it is possible to consider two t ypes of gate collapsing schemes, as shown in Figure 6b and c. This scheme increases the versatility of choices available for the dynamic programming procedure, and this additional exibility can give signi cant improvements in the results.
Experimental Results: OTR
The OTR method described in this paper was implemented in C on a SUN Sparc 1 170 workstation. For purposes of comparison, results were generated using SIS 26 , and OTR on the ISCAS'85 benchmark circuits.
The circuits were rst decomposed into inverters and two-input NAND gates using SIS. Next, we performed a minimum circuit delay technology mapping in SIS for the circuits using the libraries nand-nor.genlib, mcnc.genlib, lib2.genlib and 44-6.genlib. To maintain compatibility b e t ween the delay models, we c haracterized the SIS library with the same set of technology parameters that we used for the circuit simulations used to generate our delay models. The value of the parameter k described in Section 2.4 was set to 4 in our work.
Our OTR results and SIS results on nand-nor.genlib, mcnc.genlib and lib2.genlib libraries are shown together in Table 1 for various circuits. These libraries were characterized so that the area and delay measures were consistent with OTR, and the numberofpower levels for each cell were consistent with those for OTR although OTR can naturally allow a wider variety of cells. The sizes of cells power levels in SIS libraries were chosen based on the sizes of the transistors of the virtual cells for OTR. We chose multiple typical power levels for virtual cells and performed SPICE simulation for delays. The process of characterization for the virtual library took us about one month. We show the results of applying this technique to nd the minimum delay, but the method can equally well be used to solve the constrained optimization problem. In this table, column 1 shows the circuit name; columns 2-4 show, respectively, the minimum delay, the corresponding area, and the CPU time These results are indicative o f t h e p o wer of our technique. It is important to note that SIS simply cannot work in our new design methodology because it cannot work on a virtual library, and requires all allowable gates to be listed and characterized in the library, which could be a prohibitive o verhead. Our methods are fast and the Table 1 .
After comparing our library-less mapping method with SIS, a library-based mapping method, we compare our method with another library-less mapping method TABA 22 . As mentioned earlier, TABA provides the latest results for library-less mapping in the literature. Table 3 shows the results in terms of the number of gates including inverters transistors using TABA and OTR on some IWLS'93 benchmark circuits. We see that the results of OTR are better than those of TABA in 6 out of 8 circuits. Table 4 shows the number of transistors the number of gates used by TABA is not reported using TABA and OTR on the ISCAS'85 benchmark circuits; again, we see that OTR yields better results than TABA. In the cases where our approach d o e s w orse, it can be C432  710  644  C499  1464  1352  C1355  1592  1304  C1908  2346  1858  C2670  2880  2842  C3540  4106  4012  C5315  5922  5542  C6288  8096  7992  C7552  8556  8298 attributed to the limitation of OTR which relies on the initial circuit decomposition.
The results can be improved still further by enhancing the set of choices available to the dynamic programming procedure by applying the procedure illustrated in Figure 6 . Table 5 shows these results: column 1 shows the circuit names, and the remaining columns show the minimum delay and area obtained with and without using this approach, and the corresponding CPU time. On average, it is seen that this improves the results by an average of 26, and a maximum of 43 in delay, while simultaneously providing area reductions of an average of 0.5, and a maximum of 57. Although static CMOS has been a mainstay of circuit design for decades, with increasing performance requirements on circuit in terms of speed and power, there is a conscious attempt to seek design styles with better performance. Several techniques such as dynamic logic and PTL have been proposed recently. In this section, we develop techniques for the synthesis of circuits with a combination of static CMOS and PTL and present a procedure that partitions a circuit into static CMOS and PTL to achieve the minimum delay.
PTL is widely considered to be a promising design style since it can implement most functions using fewer transistors than a static CMOS implementation. This reduces the overall capacitance, resulting in circuits with higher speed and lower power dissipation. The logic style is illustrated in Figure 7 , which shows a PTL logic segment that realizes the two-input AND function. Only recently has PTL become noticed as a viable design style in its own right, and consequently and there are no mature synthesis tools to realize the advantages of this logic style.
In dealing with PTL, a designer must be aware of the following limitations:
1 For an nMOS pMOS transistor, the low-to-high high-to-low transition is imperfect and therefore PTL cannot achieve full voltage swings, resulting in reduced noise margins.
2 It is possible for sneak paths between V dd and ground to exist unless the circuit is designed carefully. An Pass transistors can be used to build a 2-input multiplexer, leading to a one-to-one correspondence between BDD's and their PTL implementations. Since a BDD can represent a n y logic function, we can use the BDD representation to directly arrive at a PTL implementation of a complex gate. In Figure 9 , we show the correspondence between a BDD node and a pass transistor, build the BDD representation for the 2-input AND gate, and arrive at the pass transistor implementation of the BDD. Figure 9a shows a BDD node whose PTL implementation is shown in Figure 9b . Using this as a basis for design, we take the BDD in Figure 9c , representing a two-input AND gate, and build the corresponding PTL implementation as shown in d. A second example of more complex logic is shown in Figure 10 . In order to implement a BDD node using a 2-input multiplexer-like pass transistor, a suitable choice of the fundamental pass transistor cell must be made. As pointed out in 6 , there are two possible types of fundamental pass-transistor units, as shown in Figure 11a and b. The rst uses a pair of nMOS pass transistors, while the other utilizes an nMOS transistor and a pMOS transistor. While the worst case noise immunity of the rst con guration is better than that of the second, it requires the generation of complementary signals at the gate inputs, which results in an extra area overhead. Moreover, the extra delay in generating the complement could lead to a sneak path, which could result in a larger power dissipation. In this work, we c hoose a fundamental cell with one nMOS and one pMOS transistor; however, the work can be extended to handle the other con guration too.
Outline of the Algorithm
The dynamic programming approach for gate collapsing proposed in Section 2.4 is adapted to develop a technique for building mixed static CMOS PTL circuits. The basic idea is to use BDD's to represent a candidate logic function that can be implemented in PTL during dynamic programming. The implementation uses the BDD package described in 3 .
In using PTL, as in the case of complex static CMOS gates, we m ust ensure that the number of pass transistors in series should be no more than a predetermined numberp. In other words, while generating BDD's, we do not permit the depth of the BDD to become larger than p, and at that point, we force the use of a static CMOS gate at the fanout. The nal circuit is likely to contain pieces of pass transistor logic that are isolated from each other by static CMOS gates. Since PTL generates high and low v oltages that are a threshold voltage away from V dd and ground, respectively, this can cause the short circuit current of the CMOS transistor to be signi cant.
These e ects are averted in practice by connecting a weak pull-up pull-down in a feedback loop from the output of the CMOS gate to its inputs, so that the high voltage is raised to V dd and the low v oltage lowered to ground potential.
The dynamic programming approach here is used to determine how the circuit should be partitioned between static CMOS and PTL implementations with OTR-based gate collapsing being used for the static CMOS segment.
In our current implementation, we use look-up table method for PTL delay model that is similar to that used for the static CMOS logic, with the coe cients being altered, i.e., using di erent v alues of coe cients in the delay equation mentioned in Section 2.3.
As in Section 2.4, and for the same reasons, the algorithm begins by decomposing the circuit into a forest of trees. For each such tree, we perform mapping by dynamic programming in a manner similar to that described in Section 2.4 to implement the design using mixed static CMOS PTL logic. Since the threshold value p is a small number, it is computationally inexpensive to generate BDD's in the fanin cone up to a BDD depth of p.
Moreover, the number of possibilities for mapping a node either into a complex gate with a bounded number, k, o f series-connected MOSFETs, or as PTL with a bounded p value, is nite and small. Therefore, the computation is fast. The dynamic programming approach is guaranteed to nd the optimal solution of mixed static CMOS PTL circuits for tree structures. For DAG structures, since the approach uses techniques that have w orked well for technology mapping, we expect the results to be near-optimal for this problem too, and as our experimental results show, the procedure leads to sensible designs.
The dynamic programming procedure is similar to the OTR algorithm described earlier and is represented by the following pseudocode:
Algorithm Outline
Input:Initial circuit decomposed into inverters and 2-input NAND gates. Output: Optimum mixed PTL static CMOS gate network.
f levelize the circuit find roots sort roots from primary inputs to primary outputs for each root generate tree for each node in the tree from leaves to the root apply dynamic programming procedure find maximum fanin cone generate all possible BDDs inside the maximum fanin cone to generate PTL solutions find all possible collapsing solutions store non inferior solutions Area,delay find optimum solution of the primary outputs g The chief di erence between the OTR approach and this approach is that we maintain BDD representations for all possible candidate PTL implementations; as mentioned earlier, due to the limitation on the number of series-connected PTL transistors, these BDD's operate within a maximum fanin cone and are typically small. We compute the possible states of a node g, or a PTL implementation using the BDD representation, and calculate the Area and Delay for every candidate state. As before, each state corresponds to an Area,Delay combination, and only the noninferior states are stored. Finally, when all noninferior states have been enumerated, the optimal state is chosen and the corresponding circuit con guration is determined.
While the nature of dynamic programming makes it inherently di cult to arrive at an accurate measure of the computational complexity, i t i s w orthwhile to attempt an estimate of the complexity. For both the OTR and the static CMOS PTL methods, we need to build complex gates, either in static form or as PTL. Suppose that for each node, it is possible to build C possible complex gates, that a complex gate can have a maximum of I inputs, and that each node can have u p t o M Area, Delay pairs stored during dynamic programming. Therefore, for each node, the amount of computation for calculating the Area, Delay pairs is OC I M. In general, C and I are bounded, and so the computation complexity can be written as OM. Since the dynamic programming technique handles each o f t h e N gates in the circuit, the computation complexity of our algorithm is ON M.
Experimental Results: PTL
The Static CMOS PTL method described above w as also implemented in C on a SUN Sparc 1 170 workstation.
Results were generated using our mixed static CMOS PTL method on the ISCAS'85 benchmark circuits. As before, the circuits were rst decomposed into inverters and two-input NAND gates. Each PTL cell has multiple power levels as for OTR. The results generated on SIS are identical to those described earlier, but are displayed again in this table for better readability. The value of the parameter p described in Section 3.3 was set to to 4 in our experiments unless speci ed. Table 6 illustrates the results of our Static CMOS PTL algorithm for various circuits, as compared with SIS on the three libraries nand-nor.genlib, mcnc.genlib, lib2.genlib. In this table, column 1 shows the circuit name; columns 2-4 show, respectively, the minimum delay, the corresponding area, and the CPU time of the SIS mapping results for each circuit on the nand-nor.genlib library. The same information is then shown for the SIS mapping results for each circuit on the mcnc.genlib library are shown in columns 5-7, and for the lib2.genlib library in columns 8-10, and nally, for Static CMOS PTL method results in columns 11-13. The last line shows the average improvements of delay and area using our approach. from these values. However, we expect that with the use of a high quality module generator, the general trends should be maintained. Table 7 shows the results of SIS on 44-6.genlib, OTR and Static CMOS PTL methods. It can be seen from this table that the static CMOS PTL design style has the best performance among the those three techniques. Table 6 and Table 7 show the comparison between our library-less method and a library-based method, SIS.
We will compare our library-less method with other library-less methods in Table 8 and Table 9 . Table 8 shows the comparison between BDDlopt.ptl 32 and our Static CMOS PTL method in terms of the number of gates.
For a fair comparison, we set p = 3 in our approach here since the PTL cells used by BDDlopt.ptl have u p t o 3 pass transistors in series. For 11 out of 13 circuits, our Static CMOS PTL are better than BDDopt.ptl because our method does not limit the PTL parts only to be XOR XNOR gates, and we use a dynamic programming and 498 PTL gates; this is not likely to lead to a viable circuit without bu ering to limit the number of PTL transistors in series. In contrast, our approach takes these requirements into account while nding the solution and consequently has a better balance of the number of CMOS cells to PTL cells.
Conclusion
We have presented a new idea of global gate collapsing for pure static CMOS designs, and of using BDD's to realize mixed technology design using a combination of static CMOS and PTL. Our goal has been to present a general technique for performing overall circuit optimization using purely topological and Boolean functional techniques for static CMOS and small BDD's for PTL. Because OTR generates complex gates according to the initial circuit structure, it does not need to perform Boolean matching or pattern matching with cells in a library.
The OTR method is fast and simple and avoids the intractable subproblems in technology mapping, such as matching and covering, by constructing complex gates topologically. The use of PTL is a powerful technique to reduce the area and power dissipation of the circuit, and our work, along with the approaches in 6,11 are early approaches in developing design automation support for PTL synthesis. We believe that the application of virtual library and static CMOS PTL design style will play a more and more important role in the high performance design. Layout issues including cell generation, cell characterization on the y remain an open and challenging problem. In our future work, we i n tend to study these problems and the problem of automated layout generation for the mapped circuits, including interconnect wires between cells.
