A standard cell library typically contains several versions of any given gate type, each of which has a di erent gate size. We consider the problem of choosing optimal gate sizes from the library to minimize a cost function such as total circuit area while meeting the timing constraints imposed on the circuit.
Introduction

Gate Sizing Problem
The delay of a MOS integrated circuit can be tuned by appropriately choosing the sizes of transistors in the circuit. While a combinational MOS circuit in which all transistors have the minimum size has the smallest possible area, its circuit delay m a y not be acceptable. It is often possible to reduce the delay of such a circuit, at the expense of increased area, by increasing the sizes of certain transistors in the circuit. The optimization problem that deals with this area-delay trade-o is known as the sizing problem.
The rationale for dealing with only combinational circuits in a world which is rampant with sequential circuits is as follows. A typical MOS digital integrated circuit consists of multiple stages of combinational logic blocks that lie between latches, clocked by system clock signals. Delay reduction must ensure that the worst-case delays of the combinational blocks are such that valid signals reach a latch in time for a transition in the signal clocking the latch. In other words, the worst-case delay of each combinational stage must be restricted to be below a certain speci cation.
For a combinational circuit, the sizing problem is formulated as minimize Area subject to Delay T spec : 1 The problem of continuous sizing, in which transistor sizes are allowed to vary continuously between a minimum size and a maximum size, has been tackled by several researchers 1 4 . The problem is most often posed as a nonlinear optimization problem, with nonlinear programming techniques used to arrive at the solution.
A related problem that has received less attention is that of discrete or library-speci c sizing.
In this problem, only a limited number of size choices are available for each gate. This corresponds to the scenario where a circuit designer is permitted to choose gate con gurations for each gate type from within a standard cell library. This problem is essentially a combinatorial optimization problem, and has been shown to be NP-complete 5 .
Chan 5 proposed a solution to the problem that was based on a branch-and-bound strategy.
The algorithm is exact for tree networks. For general networks that are not tree-structured, a backtracking-based algorithm is proposed for nding a feasible solution. For general DAGs directed acyclic graph, a cloning procedure is used to convert the DAG i n to an equivalent tree, whereby a vertex of fan-out m is implicitly duplicated m times, followed by a reconciliation step in which a single size that satis es the requirements on all of the cloned vertices is selected.
The approach of Lin et al. 6 uses a heuristic algorithm that is an adaptation of the TILOS algorithm 1 for continuous transistor sizing, with further re nements. The approach is based on a greedy algorithm that uses two measures known as sensitivity and criticality to determine which cell sizes are to be changed. The sensitivity of a cell indicates how m uch local delay per unit area can be decreased if we pick another template for this speci c cell, while criticality tells us whether a cell has to be replaced by a larger template to ful ll the delay constraints of the circuit. A weighted sum of a cell's sensitivity and criticality is used to guide the algorithm to select a certain number of gates to be replaced with a di erent template.
Another algorithm proposed by Li et al. 7 is exact for series-parallel circuits. See 7 for a formal de nition of series-parallel circuits. For a chain circuit serial circuit, a dynamic programming technique is used to obtain optimal solution. For a simple parallel circuit, a number of transformations are repeated to obtain the optimal implementation. The optimal implementation of any series-parallel circuit is obtained by repeatedly using the chain and simple parallel circuit transformation on subcircuits of the given series-parallel circuit. This work is extended to nonseries-parallel circuits, whose structures are represented by general DAGs, and several heuristic techniques are used in conjunction with the algorithm, but no guarantees on optimality are made for such circuits.
Both of the above t wo approaches 6, 7 are heuristics Chan's approach 5 is also heuristic for general circuits, and hence no concrete statements can be made on how close their solutions are to the optimal solution. Moreover, neither work shows comparisons with a technique such a s simulated annealing 8 that is known to give optimal or near-optimal solutions.
In the rst part of this paper, we present a new algorithm for solving the gate sizing problem for combinational circuits that takes into consideration the variations of gate output capacitance with gate resizing. In the rst stage, the gate sizing problem is formulated as a linear program. The solution of this linear program provides us with a set of gate sizes that does not necessarily belong to the set of allowable sizes. Therefore, in the second phase, we m o ve from the linear program solution to a set of allowable gate sizes, using heuristic techniques. In the third phase, we further ne-tune the solution to guarantee that the delay constraints are satis ed. Finally, to illustrate the e cacy of our algorithm, we present a comparison of the results of this technique with the solutions obtained by simulated annealing as well as by our implementation of the algorithm in 6 .
Optimization for Synchronous Sequential Circuits
Optimization of synchronous sequential circuits, on the other hand, is di erent. An additional degree of freedom is available to the designer in that one can set the time at which clock signals arrive a t v arious ip-ops FFs in the circuit by controlling interconnect delays in the clock signal distribution network. With such adjustments, it is possible to change the delay speci cations for the combinational stages of a synchronous sequential circuit to allow for better sizing. However, consideration of clock s k ew in conjunction with sizing increases the complexity of the problem tremendously, since it is no longer possible to decouple the problem and solve it on one subcircuit at a time.
In general, given a combinational circuit segment that lies between two ip-ops i and j, i f s i and s j are the clock arrival times at the two ip-ops, we h a ve the following relations: s i + Maxdelayi; j + T setup s j + CP 2 s i + Mindelayi; j s j + T hold 3 where Maxdelayi; j and Mindelayi; j are, respectively, the maximum and the minimum combinational delays between the two ip-ops, and CP is the clock period. Fishburn 9 studied the clock s k ew problem, under the assumption that the delays of the combinational segments are constant, and formulated the problem of nding the optimal clock period and the optimal skews as a linear program. The objective w as to minimize CP, with the constraints given by the inequalities in 2 and 3 above. In real design situations, however, CP is dictated by system requirements, and the real problem is to reduce the circuit area.
In the second part of the paper, we examine the following problem: Given a clock period speci cation, how can the area of a synchronous sequential circuit be minimized by appropriately selecting gate size for each gate in the circuit from a standard-cell library, and by adjusting the delays between the central clock and individual ip-ops? For simplicity, the analysis will use positive-edge-triggered D-ip-ops. In the following, the terminologies ip-op FF and latch will be used interchangably. W e assume that all primary inputs PI and primary-outputs PO are connected to FFs outside the system, and are clocked with zero or constant skew.
We rst present an optimization algorithm for small synchronous sequential circuits. Then we consider arbitrarily large synchronous sequential circuits for which the size of the formulated problems is prohibitively large, and present a partitioning algorithm to handle such circuits. The partitioning algorithm is used to control the computational cost of the linear programs. After the partitioning procedure, we can apply the optimization algorithm to each partitioned subcircuit. This paper is organized as follows. We describe the linear programming approach in Section 2, followed by the two post-processing phases in Sections 3 and 4. In Section 5, we formulate the synchronous sequential circuit area optimization problem and present the algorithms to tackle the problem. The partitioning algorithm that allows us to handle large circuits is presented in Section 6.
Experimental results are given in Section 7. Finally, Section 8 concludes this paper.
Problem Formulation
Formulation of Delay Constraints
The delay of a gate in a standard-cell library can be characterized by delay = R out C out + = R u w C out + 1 w + 2 4 where R out is the equivalent resistance of the gate, C out is the load capacitance of the gate, is the intrinsic delay of the gate, R u represents the on-resistance of a unit transistor, and w i is called the nominal gate size of g i . Therefore, the size of each gate can be parameterized by a n umber, w, referred to as the nominal gate size. It is worth pointing out that the solution to this problem is not necessarily the optimal solution; however, it is very likely that the nal objective function value at a solution arrived using good heuristics will be close to the linear program solution, and hence close to the optimal solution. This supposition is borne out by the results presented in Section 7.1.
Implicit Enumeration Approach
The rationale behind our global enumeration algorithm is based on the following observation. Given the solution of the linear programming, the majority of the gates remain at their smallest sizes.
Only a small portion of the gates in the circuit are moved to a larger size because, for a typical circuit, although there may b e a h uge number of long paths, the number of gates on these long paths is, in general, relatively small.
Based on this observation, during the implicit enumeration procedure we m a y ignore those gates which are assigned to have their smallest size by the solution of the linear programming, and concentrate on those gates that have been assigned larger sizes and are probably on long paths.
De nition 1 A critical gate is a gate whose size is larger than its smallest possible size.
We modify the circuit topology by adding a source node so and a sink node si. A dummy edge is added from node so to each of the input nodes and from each of the output nodes to the node si. Next, for each gate i we de ne max-delay-to-sink, denoted by mdsi, to be the maximum of the delays of all possible paths starting from gate i to the sink node si 14 . That is,
The method for nding max-delay-to-sink is a topological sort. That is, mdsi of a gate i can be calculated only after all of the mds's of its fan-out gates have been computed. Therefore, the computation of mds's starts from sink node si and proceeds backwards until we reach the source node so.
A breadth-rst search is applied to levelize the circuit from the sink node backwards. 2 The level of a gate i in this levelization is called its backward circuit level, c leveli. By de nition, the backward circuit level of the sink node si is 0, while the source node so has the largest backward circuit level. Starting from si, w e form a state-space tree by implicitly enumerating critical gates. During the enumeration, noncritical gates remain at their minimum size and need not be enumerated. Each level in the state-space tree corresponds to a critical gate. The corresponding critical gate of level i is gate k. We also de ne a function Fi, which is used to indicate the corresponding critical gate of level i. That is, the corresponding critical gate of level i is indicated by Fi. Therefore, if gate k is the corresponding critical gate of level i, then k = Fi. Similarly, the corresponding level of a critical gate k in the state-space tree is called the gate's tree level, t levelk. Therefore t levelFi = i. Each node at level i in the state-space tree is a cell conguration, which represents a possible realization of its corresponding gate. Let Ci; j denote the jth node at level i, and anci; j be its ancestor node. 2 This is di erent from a traditional levelizing scheme which is done starting from the source node and proceeds forwards. In the state-space tree, each node has no more than two successors since there are at most two choices for the gate size. The root of the tree is, by de nition, assigned a null cell con guration 0; 0; 0. We begin with the critical gate that has the smallest backward circuit level and implicitly enumerate the two possible realizations of each gate Fi, w Fi+ and w Fi, .
and denoted by D 0 ij . The cell con guration which has the largest D 0 ij and satis es D 0 ij T spec is selected. By performing a trace-back from the selected leaf node to the root of the tree, the size of each critical gate is determined from the cell con gurations at each traversed node. where R i out and C i out are, respectively, the equivalent driving resistance of gate i, and the capacitive load driven by gate i. Therefore, delayG i is the di erence between the original local delay o f G i and the new local delay o f G i after we replace it with a di erent gate size that has a di erent value of R i out and C i+1 out .
After calculating the local delay di erence associated with each of the gates along path P l , w e select the largest one, delayG n , which satis es delayG n PslackP l 21 and change the size of G n accordingly. If none of the local delay di erences satisfy 21, we select the most negative one and replace the gate with a new realization. This process continues until the delay constraints are all satis ed. Also, notice that unlike in the mapping algorithm, we d o n o t restrict our choices to w i+ and w i, at this phase.
Optimization for Sequential Circuits
The techniques described so far are valid for the sizing problem for combinational circuits. We n o w consider the optimization problem for synchronous sequential circuits.
Formulation of Constraints
In a synchronous sequential circuit, a data race due to clock s k ew can cause the system to fail 15 .
Consider a synchronous sequential digital system with ip-ops FFs. Let s i denote the individual delay b e t ween the central clock source and ip-op FF i , and let CP be the clock period. Assume there is a data path, with delay d ij , from the output of FF i to the input of FF j for a certain input combination to the system. There are two constraints on s i ; s j and d ij that must be satis ed: Double Clocking : I f s j s i + d ij , then when FF i is clocked, the data races ahead through the path and destroys the data at the input to FF j before the clock arrives there.
Zero Clocking : This occurs when s i + d ij s j + CP, i.e., the data reaches FF j too late.
It is, therefore, desirable to keep the maximum longest-path delay small to maximize the clock speed, while keeping the minimum shortest-path delay large enough to avoid clock hazards.
In 9 , Fishburn developed a set of inequalities which indicates whether either of the above hazards is present. In his model, each FF i receives central clock signal delayed by s i by the delay element imposed between it and central clock. Further, in order for a FF to operate correctly when the clock edge arrives at time t, it is assumed that the correct input data must be present and stable during the time interval t ,T setup ; t + T hold , where T setup and T hold are the set-up time and hold time of the FF, respectively. F or all of the FFs, the lower and upper bounds MINi; j and MA Xi; j 1 i; j L ; L being the total number of FFs in the circuit are computed, which are the times required for a signal edge to propagate from FF i to FF j .
To a void double-clocking between FF i and FF j , the data edge generated at FF i by a clock edge may not arrive a t FF j earlier than T hold after the latest arrival of the same clock edge arrives at FF j . The clock edge arrives at FF i at s i , the fastest propagation from FF i to FF j is MINi; j.
The arrival time of the clock edge at FF j is s j . T h us, we h a ve s i + MINi; j s j + T hold : 22 Similarly, t o a void zero-clocking, the data generated at FF i by the clock edge must arrive a t FF j no later than T setup amount of time before the next clock edge arrives. The slowest propagation time from FF i to FF j is MA Xi; j. The clock period is CP, so the next clock edge arrives at FF j at s j + CP. Therefore, s i + T setup + MA Xi; j s j + CP: 
Symbolic Propagation of Constraints
We begin by counting the number of LP constraints in 25. We ignore the constraints on the maximum and minimum sizes of each gate since these are handled separately by the simplex The synchronous sequential circuit is rst levelized. For this purpose, the inputs of FFs are considered as pseudo POs the outputs of FFs are considered as pseudo PIs. Two string variables, mstringi and pstringi, are used to store the long-path delay and short-path delay constraints associated with gate i, respectively. For each gate and each FF, an integer variable u i 2 f 0; 1g is introduced to indicate its status. u i has the value 1 whenever mstringi and pstringi are non-empty, i.e., when the constraints stored in mstringi and pstringi m ust be propagated; otherwise, u i = 0 .
The algorithm for propagating delay constraints symbolically is given in Figure 1 . In the following discussion of the algorithm, we elaborate on the formation of mstring; the formation of pstring proceeds analogously. A t line 2, for each gate j, u j and mstringj are initialized by setting u j = 0, and mstringj to the null string. The status variable of primary input i, u i , i s set to 1, however, since we are to process delay constraints with respect to that particular PI line 3. At line 6, we c heck i f u l = 0 for all l 2 fanink, i.e., if all of gate k's input gates have a null mstring. If so, no constraints need to be propagated, and no operations are needed. Hence we continue to process the next gate. Next, at line 7, we c heck whether exactly one of all of gate Using the symbolic constraints propagation algorithm, although the actual reduction is dependent on the structure of the circuit, experimental results show that this algorithm can reduce the number of constraints to less than 7 of the original number on the average for the tested circuits.
Inserting Delay Bu ers to Satisfy Short Path Constraints
The solution of the LP would, in general, provide a gate size, w k that does not belong to the permissible set, S k = fw k;1 w k;q k g. If so, we consider the two permissible gate sizes that are closest to w k ; w e denote the nearest larger smaller size by w k+ w k, . As in Section 3, we formulate the following smaller problem:
For all k = 1 N: Select w k = w k+ or w k, , such t h a t for all FFs 1 i ; j L s i + Maxdelayi; j + T setup s j + CP s i + Mindelayi; j s j + T hold
The mapping algorithm described in Section 3 can be used to obtain a solution for this problem.
After the mapping phase, if some of the delay constraints cannot be satis ed, we h a ve to netune some gate sizes in the circuit. In Section 4, we h a ve discussed the approach to resolve violation of long path delay constraints. The same strategy can be applied for synchronous sequential circuit optimization, except the de nition of path slack m ust be modi ed. Note that if gate i is at a PO, it could still fan out to other gates in the circuit; this is re ected in the de nition of the gate slack. Physically, a gate slack corresponds to the amount b y which the delay of gate i can be increased before its e ect will be propagated to any POs or FFs, in terms of long path delay. Therefore, it also tells us the maximum delay that a delay bu er can have i f w e are to insert a delay bu er at the output of gate i. i i + 1 ; Figure 3 : The bu er insertion algorithm.
If output gate G n1 violates the hold time constraint, its shortest path, P s n1, to some PI is rst identi ed. If p n1 is the worst-case shortest path signal arrival time of gate n1, and req s n1 is the required shortest path delay, then the delay o f P s n1 must be increased by at least req s n1 ,p n1 .
At the beginning of this phase, we rst back-propagate gate slacks from POs and all FFs. The gate slack of each gate is determined recursively using 28.
The algorithm for inserting bu ers is shown in Figure 3 . In line 4 of the algorithm, beginning from the smallest bu er in the library, w e try to insert a bu er at the output of gate G ni . The delay of the bu er is denoted by delaybf. Since the output capacitance of G ni is changed during this process, we h a ve to recalculate its delay, which is denoted by delay 0 G ni .
Partitioning Large Synchronous Circuits
As indicated above, the number of constraints in our formulation of the LP is in the worst proportional to the product of the number of gates and the number of FFs in the circuit. Ideally for a given synchronous sequential circuit, all variables and constraints should be considered together to obtain an optimal solution. However, for large synchronous sequential circuits, the size of the LP could be prohibitively large even with our symbolic constraint propagation algorithm. Therefore, it is desirable to partition large synchronous sequential circuits into smaller, more tractable subcircuits, so that we can apply the algorithm described in Section 5 to each subcircuit. While this would entail some loss of optimality, an e cient partitioning scheme would minimize that loss; moreover the reduction of execution time would be very rewarding.
It is well-known that multiple-way network partitioning problems are NP-hard. Therefore, typical approaches to solving such problems nd heuristics that will yield approximate solutions in polynomial time 17 24 . Traditional partitioning problems usually have explicit objective functions; for example, in physical layout it is desirable to have minimal interface signals resulting from partitioning the circuit, and hence the objective function to be minimized there is the number of nets connecting more than two blocks. Our synchronous sequential circuit partitioning problem, however, is made harder by the absence of a well-de ned objective function; since our ultimate goal is to minimize the total area of the circuit, there is no direct physical measure that could serve as an objective function for partitioning. In the remainder of this section, we develop a heuristic measure that will be shown to be an e ective objective function for our partitioning problem.
To help us describe our partitioning algorithm, we i n troduce the following terminology. F or a synchronous sequential circuit, such as the one shown in Figure 4 :
An internal latch is a latch whose fanin and fanout gates belong to the same combinational block.
A sequential block consists of a combinational subcircuit and its associated internal latches.
Boundary latches are latches that act as either a pseudo PI or a pseudo PO but not both to a combinational block, i.e. latches whose fanin and fanout gates belong to di erent combinational blocks.
A partition of a synchronous sequential circuit N is a partition of the sequential blocks of N where N is the number of groups, MaxConstraints is the maximum number of constraints that one wishes to feed to the LP, and 1 i s i n troduced so that the partitioning procedure becomes more exible since the cost of a group is allowed to exceed MaxConstraints temporarily. N o w that the partitioning problem has been explicitly de ned, we develop a multiple-way synchronous sequential circuit partitioning algorithm based on the algorithm proposed by Sanchis 20 .
For each group G k , and each boundary latch L, de ne the connection number, , as: such that for each group k, CG k MaxConstraints: This is an integer knapsack problem, and many heuristic algorithms can be used to obtain an initial partition see, for example, 25 , Chapter 2. In some cases, it may be impossible to put all blocks into N groups without violating the restriction on CG k a b o ve; if so, the number of groups may be larger than that given in 34.
Given the initial partition, the algorithm improves it by iteratively moving one block of partition from one group to another in a series of passes. A block is labeled free if it has not been moved during that pass. Each pass in turn consists of a series of iterations during each of which the free block with the largest gain is moved. During each m o ve, we ensure that the number of constraints in a group does not violate the limit given by 31. The gain number, , ij B, is updated constantly as blocks are moved from one group to another. At the end of each pass, the partitions generated during that pass are examined and the one with the maximum objective v alue, as given by 31, is chosen as the starting partition for the next pass. Passes are performed until no improvement o f the objective v alue can be obtained.
After the partitioning, we apply the optimization algorithm described in Section 5 to each group.
Experimental Results
The algorithms above w ere implemented in a program GALANT GAte sizing using Linear programming ANd heuricTics on a Sun Sparc10 station. The test circuits include many of the ISCAS85 combinational benchmark circuits 26 and ISCAS89 synchronous sequential circuits 27 .
Each cell in the standard-cell library has four di erent sizes of realization with di erent driving capabilities. The number of regions of piecewise linear approximation of delay, q, is set to be 4 empirically. Ideally, the larger q is, the more accurately we can approximate the delay function. However, increasing the value of q will directly increase the number of constraints in the LP formulation, which will result in larger run times. Therefore it is desirable to keep q small while maintaining acceptable approximation errors. Section 7.1 provides experimental results for the combination circuit optimization problem. The experimental results for synchronous sequential circuits with clock s k ew optimization are given in Section 7.2.
Experimental Results for Combinational Circuits
To prove the e cacy of the approach, a simulated annealing algorithm and Lin's algorithm 6 were implemented for comparison. The parameters used in Lin's algorithm have been tuned to give the best overall results. The simulated annealing algorithm that we h a ve implemented is similar to that described in 28 . However, unlike in 28 , all gate sizes were allowed to change during the simulated annealing procedure; while the run-times for this procedure were extremely high, the solution obtained can safely be said to be close to optimal.
The results of our approach, in comparison with Lin's algorithm and simulated annealing, are shown in Table 1 . The test circuits include most of the ISCAS85 benchmarks, and vary in size from For somewhat tight speci cations, however, its solution becomes pessimistic. For even tighter delay constraints, it cannot obtain a solution at all. As mentioned previously, Lin's algorithm essentially is an adaptation of the TILOS algorithm 1 for continuous transistor sizing, with a few enhancements. While the TILOS algorithm is known to work reasonably well for the continuous sizing case, the primary reason for its success is that the change in the circuit in each iteration is very small. However, in the discrete sizing case, any c hange must necessarily be a large jump, and a TILOS-like algorithm is likely to give v ery suboptimal results.
A comparison of GALANT, Lin's algorithm, and simulated annealing on the circuit c432, for various timing speci cations, is shown in Figure 5 . In all cases, the solution obtained by GALANT is very close to the solution obtained by simulated annealing. In comparison with the results of Lin's algorithm, we nd that GALANT provides results of substantially better quality, with reasonable run-times. Table 2 shows the areas given by linear programming, Galant, and simulated annealing for C432 
Experimental Results for Synchronous Sequential Circuits
For all the experiments on sequential circuits, the clock s k ews are restricted to be less than half of the clock cycle to avoid excessively large di erence between clock signal arrival times at di erent latches. In Table 3 , the experimental results of fteen ISCAS89 circuits are listed. For information on the number of PIs, POs, FFs, and logic gates in the circuits, see 27 . For each circuit, the number of longest-path delay constraints without using symbolic constraint propagation algorithm and the number of constraints pruned by the algorithm are given. It is clear that our pruning algorithm is very e cient. The number of delay constraints is reduced by more than 93 on the average. For a given desired clock period, the optimized results for both with and without clock skew optimization are shown. Depending on the structure of the circuits, the improvement o ver total area of the circuit ranges from 1.2 to almost 20. As for the execution time, the runtime ranges from about the same for some circuits, to less than double or triple for most circuits. Table 4 provides some more in-depth experiments of two circuits, s838 and s1423. In this experiment, we try to minimize the area using di erent speci ed clock periods. As one can see, for s1423, the minimum clock period without clock s k ew optimization is about 32.5. On the other hand, using clock s k ew optimization, the minimum period can be as small as 22, which gives an almost 33 improvement in terms of clock speed. For s838, using clock s k ew optimization also gives an 30 improvement. Hence, using clock s k ew optimization can not only reduce the circuit area, but also allows a faster clock speed. Table 5 gives the experimental results for the partitioning procedure. Since most of the ISCAS89 circuits consist of only one combinational block, we generated some synchronous sequential random logic circuits. The number of gates and FFs in those circuits are shown in Table Table 4 : Improving possible clocking speeds using clock s k ew optimization. 1. First, we minimize the area using clock s k ew optimization, but without partitioning.
2. Secondly, w e minimize the circuit area using both clock s k ew optimization and partitioning.
3. For comparison, we minimize the circuit with neither clock s k ew optimization nor partitioning.
From the table, it can be seen that the rst approach is able to obtain the best result as expected. Since it considers all variables at the same time, it provides the best solution. However, the runtime is large. Compared to the rst approach, the second approach runs much faster, at a very slight area penalty. Not surprisingly, the third approach gets the worst solution. We also note that the introduction of clock s k ew provides a signi cantly faster clock speed for circuit m1337. Although it has not been shown here, the same result also holds for m1783. For m1783, we also specify several di erent MaxConstraints. The result shows that as the speci ed MaxConstraints increases, the number of groups after partitioning decreases. As the number of groups decreases, the optimized solution using partitioning procedure improves, while the runtime only increases slightly.
When N = 6, the solution is comparable to that without using partitioning, and the runtime is still far less than that without using partitioning. z N, n umber of groups after partitioning.
Conclusion
In this paper, an e cient algorithm is presented to minimize the area taken by cells in standard-cell designed combinational circuits under timing constraints. We present a comparison of the results of our algorithm with the solutions obtained by our implementation of Lin's algorithm 6 and by simulated annealing. In 6 , it was shown that Lin's algorithm is able to obtain better results than the technology mapping of MIS2 29 . Although Lin's algorithm is fast, its solution becomes excessively pessimistic for tight delay constraints. For very tight timing constraints, it fails to obtain a solution at all. Experimental results show that our approach can obtain near-optimal solution compared to simulated annealing in a reasonable amount of time, even for very tight delay constraints. By adding additional linear programming constraints to account for short path delay 30 , and slightly modifying the mapping and adjusting algorithm, the same approach can be used to tackle the double-sided delay constraints problem.
A uni ed approach to minimizing synchronous sequential circuit area and optimizing clock skews has also been presented. The skews at various latches in a circuit may be set using the algorithm in 31 . Traditionally, the circuit area of a synchronous sequential circuit is minimized one combinational subcircuit at a time. Our experiments have shown that this may lead to very suboptimal solution in some cases.
We formulate the discrete gate sizing optimization as a linear program, which enables us to integrate the equations with clock s k ew optimization constraints, taking a more global view of the problem. Experimental results show that this approach can not only reduce total circuit area, but also give m uch faster operational clock speed. For large synchronous sequential circuits, we also present a partitioning procedure. Our experiments show that our partitioning procedure is very e ective in making our optimization algorithm run at a much faster speed, with no signi cant degradation in the quality of the solution.
Finally, the clock s k ew scheme may appear similar to maximum-rate pipelining technique used in pipelined computer systems 32 . However, the clock in a maximum-rate pipeline cannot be single-stepped or even slowed down signi cantly. This makes maximum-rate designs extremely hard to debug. In the clock s k ew scheme, by constrast, single-stepping is always possible 9 . Therefore circuits implemented using clock s k ew technique can be debugged without di culties. Table. 2 Experimental results of areas given by linear programming, GALANT and simulated annealing for circuit c432. Table. 3 Performance comparison with and without clock s k ew optimization for ISCAS89 benchmark circuits. Table. 4 Improving possible clocking speeds using clock s k ew optimization. Table. 5 Performance comparison of the partitioning procedure. Fig. 1 The symbolic constraints propagation algorithm. 
