Abstract-The timing-convergence problem arises because estimations made during logic synthesis may not be met during physical design. In this paper, an efficient rewiring engine is proposed to explore maximal freedom after placement. The most important feature of this approach is that the existing placement solution is left intact throughout the optimization. A linear-time algorithm is proposed to detect functional symmetries in the Boolean network which are then used as the basis for rewiring. Integration with an existing gate-sizing algorithm further proves the effectiveness of our technique. Three applications are demonstrated: delay, power, and reliability optimization.
I. INTRODUCTION

I
N THIS PAPER, we present a technique focusing on the identification of symmetries when the target Boolean function is represented as a mapped Boolean network. In general, when two wires are functionally symmetric, they can be swapped without changing the overall circuit functionality. We have established a relationship between implication supergate [24] and functional symmetry. Based on our analysis of supergates, we propose a linear-time algorithm for symmetry identification in a multilevel netlist. We have developed efficient postplacement performance-improvement algorithms which apply symmetry-based rewiring and gate sizing.
Power consumption and speed are two primary cost functions in today's integrated circuit design. As mobile computation devices prevail in the market, the ability to design fast, low-power devices is of paramount importance. However, these two objectives often conflict: a faster circuit consumes more power, but a low-power circuit runs slower. Hence, designers often need to tradeoff power for speed and vice versa to meet the desired specifications. To get the best performance, power and speed are considered at various stages of the design cycle, including architecture, register-transfer level, gate, and layout levels.
Circuit reliability is another emerging design consideration. Design trends, such as device miniaturization, system-on-a-chip integration, and higher operating frequencies, increase concerns about circuit reliability. Hot-carrier effect (HCE) is one of the major failure mechanisms affecting long-term reliability. As the device dimensions shrink to the deep submicron ranges, the electric field in a transistor's channel increases significantly. Electrons and holes traveling in the channel may gain high enough kinetic energy to be injected into the gate oxide and cause permanent changes in the oxide-interface charge distribution. In an NMOS transistor, HCE leads to transconductance degradation, shift in the threshold voltage, and decrease in the drain-current driving capability. The performance degradation of particular devices leads to degradation in the overall circuit performance. The transistor degradation behavior is a function of time, the number of transitions, its fanins' driving capability, and geometric dimensions. The effects accumulate while the device is in operation. As a result, circuits age.
Our proposed rewiring technique will be used to target delay, power, and reliability optimization specifically at the postplacement level when the gate level netlist is already placed on a twodimensional plane. The rationale behind this methodology is that delay, power, and reliability cannot be determined without physical information. The optimization effort could be misled by inaccurate estimation of the objective function. However, since the placement is already done, optimization techniques used at this level must not perturb the existing placement solution too much, in order to guarantee timing closure. Buffer insertion and gate sizing are traditionally the only two techniques that are suitable for this purpose. However, these two techniques are limited to the existing netlist, not enabling us to explore a larger solution space by restructuring the logic. In recent years, rewiring techniques, such as redundancy-addition-and-removal [11] have been successfully applied at postlayout stages to restructure the logic. This technique has the property that only wires are reconnected without disturbing the existing placement solution. This important property allows logic changes to be guided by accurate delay information. Our proposed functional symmetry-based rewiring technique explores another degree of freedom compared with the existing techniques.
II. PRELIMINARIES
A Boolean network [1] is a directed acyclic graph (DAG) , where is the set of vertices and is the set 0278-0070/04$20.00 © 2004 IEEE of edges connecting vertices. Let be a vertex in . The (immediate) fanout of is a set of nodes such that there is an edge from to . Similarly, the (immediate) fanin of is a set of nodes , such that there is an edge from to . A node is called an internal node if it is neither a primary input (PI) nor a primary output (PO). The fanout cone of , or transitive fanout of , is a set of nodes , such that there exists a directed path from to . Similarly, the fanin cone of , or transitive fanin of , is a set of nodes such that there exists a directed path from to . The terms "vertex" and "node" are used interchangeably. Each vertex has a function associated with it. Function maps the Boolean space spanned by its fanins to the space spanned by its fanouts. In practice, we are interested only in the case when is a single-output function. All fanouts of are fed by the same . Even though is a single-output function, the number of fanouts of can be more than one.
Let be a library of logic gates. A Boolean network is mapped when the function associated with each vertex is implemented by a gate from . Otherwise, it is unmapped. Let be a gate in . An in-pin of is the connector of to which outputs of other gates can connect. An out-pin of is the connector that can drive other gates' in-pins. The logic type of is denoted type . We do not distinguish between the name of a gate and its out-pin.
Let be a fanout-free network rooted at . By fanout-free, we mean that each node inside has only a single fanout. A path is an alternating sequence of pins and gates such that for any two consecutive gates the former is driving the latter. For example, a path starts from the in-pin of gate and ends at the in-pin of gate . A path is fanout-free if all gates along the path are fanout-free. An input to a gate on the path is a side input if the gate driving is not on the path. Definition 1: An input-controlling value of an in-pin of a gate , denoted , is the logic value which, when set at , uniquely determines the output of regardless of the logic values on other inputs. An output-controlled value of a gate , denoted , is 's output value when one of its input pins is set to its input controlling value.
For example, when type NAND, the input-controlling value of an in-pin of is 0, and output-controlled value of is 1. For buffers and inverters, both 0 and 1 are input-controlling values, since they uniquely determine the output. The input-controlling value of the in-pins of an exclusive-or (XOR) gate is undefined since no logic value at any single input can uniquely determine the output. In this case, the output controlled value is also undefined.
Definition 2: An output noncontrolled value of the gate , denoted , is the logic value which, when set at the output of , uniquely implies the logic values at the inputs of . An input noncontrolling value of an in-pin of , denoted , is the logic value inferred when the output of is set to its output noncontrolled value.
For example, when type NAND, the output noncontrolled value of is 0, since a 0 at 's output uniquely implies that all inputs to have to be set to 1, the input noncontrolling value. For buffers and inverters, both 0 and 1 are output noncontrolled values since they also imply the only input's logic value. The output noncontrolled value and input noncontrolling values are undefined for exclusive-or gate.
Logic implication is a process of inferring consistent logic values based on known logic values. Given a logic value assigned at the out-pin of gate , the direction of implication can be forward or backward, until no more logic values can be inferred. If , all in-pins of can be inferred with logic value . This process is called direct backward implication. For example, let type AND and . All in-pins of are inferred with logic value 1. Direct backward implication stops at a gate when the value assigned at the out-pin of is not equal to and hence no logic value at the in-pins of can be further inferred.
is the value set at a pin during direct backward implication.
Let be an ordered list of nodes from a Boolean network. If for any preceding in is not in the transitive fanout cone of , then is in a topological order. is in reverse topological order if for any , in , is not in the transitive fanin cone of .
We will use the following two types of symmetries. The first and fourth relationships are clearly tautological. Hence, for and to be NES, the cofactors and have to be equivalent. The name "nonequivalence symmetric" was derived from the fact that in the second and third relationships, and are of opposite logic values.
Definition 4:
and are equivalence symmetric (ES) [8] in if and only if . That is, . Again, we could plug in all four possible value combinations to and in the above equation and derive the following four relationships:
This time, the second and third relationships are tautological. Hence, for and to be ES, the cofactors and have to be equivalent. The name "equivalence symmetric" was derived from the fact that in the first and fourth relationships, and are of the same logic values.
Definition 5:
and are symmetric in if they are either NES or ES in . That is, when the distinction between NES and ES is not of importance in the context, we use the term "symmetric" to mean the relationship is either NES or ES.
Detecting symmetry has long been an active area of research in switching theory. In [16] , necessary conditions for the existence of symmetry in a completely specified Boolean function are identified using the structural properties of ROBDD [1] . Chang et al. [4] present an extension to handle incompletely specified functions. In [19] , symmetry detection is transformed into a test generation problem and solved using automatic test pattern generation (ATPG) techniques.
We briefly review some terminology used in test generation. After a chip has been designed and fabricated, it must be tested to determine whether it is working correctly. This is done by applying input vectors and then capturing and analyzing the output response. For a sequential circuit, up to vectors may be tested, where is the number of primary inputs and is the number of flip-flops. Applying all of the possible input vectors may take too much time. The single stuck-at-fault model assumes that the physical defects manifest themselves as wires, which are permanently connected to either V or GND, and that only one such stuck-at wire exists in a given circuit.
For the single stuck-at-fault model, a wire from a node to a node could be stuck at either 1 or 0. Let be a multiple-input multiple-output Boolean function implemented by a combinational circuit C. The functions and are implemented by the faulty and good circuits, respectively. We use the D-notation [21] to represent the fault effect. D (1/0) means that in the good circuit, the value of a particular wire is 1 whereas in the faulty circuit the value of the same wire is 0.
denotes the opposite case. The fault on is testable if there exists a primary input vector such that . That is, the difference between the good and faulty circuits can be observed at primary outputs when the primary input vector is applied. When no vector exists, which can distinguish the faulty circuit from the good one, the fault on is redundant and can be removed by assigning the constant stuck-at-value on . The process of finding such a vector through algorithmic means is called automatic test pattern generation (ATPG). The processes of test generation and redundancy identification are known to be NP-hard.
In test generation, each node in the network could assume one of five different logic values which are 0, 1, , , or unassigned. Logic operations work in a bit-wise fashion. For example, the logic-OR operation (+) between 1 and D can be determined from the individual operations in the good and faulty circuits, that is, 1 1 in the good circuit, and 1 0 in the faulty circuit. The result is 1 in the good circuit, and 1 in the faulty circuit. In short,
. Other operations can be similarly deduced.
We now reiterate the main results of [19] in the following lemma. function is indistinguishable from the function , which is exactly Definition 3. Lemma 1 establishes the link between the theory of test generation and the traditional definition of functional symmetry.
All previous attempts at symmetry detection have focused on finding symmetries in the primary inputs of a given function. Let be a multiple-input, multiple-output Boolean function defined on represented by a mapped Boolean network . Also, let be a single-output subnetwork of and let be the corresponding Boolean function defined on , where is a set of internal signals of . Instead of finding symmetries for , with respect to , we focus on the identification of symmetries for , with respect to . The number of detected symmetries increases dramatically since is only a subfunction of . This analysis forms the basis of our rewiring technique.
Symmetries detected inside a Boolean network immediately provide ways to restructure the network for better performance. In Fig. 1(a) , if we know and are symmetric, they can be swapped without changing the overall circuit functionality. Depending on the optimization goal, one of the two circuits might be better than the other. We now present our approach to detect functional symmetries inside a Boolean network.
III. DETECTION OF SYMMETRIES IN A NETWORK
In this section, we use the theory of ATPG as a tool for the proofs. Our algorithm does not use ATPG.
A. Symmetry Detection
Let be a fanout-free network rooted at . Let and be two in-pins in as illustrated in Fig. 2 . Since is fanout-free, there exists a unique path from to , where p is an in-pin of . Similarly, there exists a unique path from to , where is an in-pin of . We use the notation and for these two paths, respectively. All of the following lemmas and theorems in this section are considered under the assumption that the underlying structure is fanout-free. We also assume that and do not properly contain each other. Since an XOR gate has no controlling value, it is clear that these two definitions are mutually exclusive. That is, if is AND-OR-reachable from , it cannot be XOR-reachable from . Also, if is XOR-reachable from , it cannot be AND-OR-reachable from . Fig. 3 illustrates the definition of AND-OR and XOR reachability. In Fig. 3(a) , when the output of is set to 1, it implies to 1, which further infers a 0 through the inverter. The backward implication ends at with a 1 implied. By definition, we say is AND-OR reachable from . In Fig. 3(b at results in either 0 or 1 being implied at , depending on the number of inversions along the path. This is in conflict with being assigned D initially. By contradiction, value cannot be set to . . Take Fig. 3(a) as an example. is AND-OR reachable from . Let a logic value D be set at . At , if we set a logic value 1, which is the input noncontrolling value of the in-pin , it immediately implies a 1 at . This would contradict the fact Proof: The proof of the if part is trivial and, hence, omitted.
We prove that if ( , ) are symmetric in , then and are AND-OR-reachable or XOR-reachable from the root of . From Lemma 5, we know that if and are not both AND-OR-reachable nor XOR-reachable, ( , ) cannot be symmetric. By the law of contraposition, we conclude that and are both AND-ORreachable or XOR-reachable if ( , ) are symmetric.
. The importance of Theorem 1 is twofold. First, it expresses functional symmetry in terms of AND-OR, XOR reachability in a fanout-free network. Second, it provides the theoretical foundation for an efficient linear time algorithm for symmetry detection. Generally speaking, the condition of AND-OR and XOR reachability leads to the identification of NES and ES among the input pins. Knowing NES and ES directly allows us to swap pins with or without adding inverters. Details of the algorithm will be presented in the next section.
B. Generalized Implication Supergate (GISG)
To improve the efficiency of the test generation process, implication supergate extraction has been proposed by Tsai et al. in [24] . The extraction starts by assigning output noncontrolled value to each of the primary outputs, and direct backward implications are performed as far as possible. Gates at which implications stop are called implication supergate roots and are as- signed their corresponding output noncontrolling value to start another round of direct backward implication. The gates that are reached by the same round of direct backward implication form an implication supergate rooted at the output of the gate from which the process has begun. The concept of implication supergate is extended in the following definition:
Definition 7: Let T be a fanout-free subnetwork of N rooted at gate . A GISG of is the set of gates in T that are either AND-OR-reachable or XOR-reachable from . A gate is covered by the GISG rooted at if . An in-pin is covered by a GISG if the gate to which it is attached is covered by the GISG.
The boundary-in-pins of a GISG are the in-pins covered by the whose fanins are neither AND-OR-reachable nor XOR-reachable from the supergate's root .
Here, the original definition of an implication supergate [24] has been extended to include XOR gates. Even though the property of AND-OR and XOR reachability can cross multiple fanout points, we restrict GISG to fanout-free regions for the purpose of keeping symmetry detection easy.
To extract the maximal GISG from a given netlist, we start from the primary outputs and process each gate in a reverse topological order. At each primary output, depending on its gate type, we attempt either direct backward implication or XOR propagation. Multiple-fanout nodes, or nodes where backward propagation stops, are treated as new GISG roots, and the propagation process continues. This procedure stops when all primary inputs are reached. After the extraction, the network is uniquely partitioned into AND, OR, and XOR supergates with inverters and buffers at their pins.
Definition 8: In a GISG network, each gate represents a root in the extracted network which contains all nodes that are covered by the same round of backward propagation originating from this root. A GISG is trivial if it covers only one gate. The type of a GISG is the same as the type of its root.
During GISG extraction, redundancy can often be easily found. We show two cases in the following.
• Case 1: backward implication conflicts at a fanout stem (see Fig. 4 ) In this case, we can write the following propositions:
so, and . That is, the value of f is independent of the value of . This means the s-a-fault at g is untestable, and hence is redundant.
• Case 2: backward implication does not conflict at a fanout stem . That is, . So, one of the fanout stems of is s-a-1 untestable and hence redundant (see Fig. 5 ). IV. SWAPPABLE PINS Definition 9: Let be an in-pin of and be an in-pin of in a mapped Boolean network . Assume that the out-pin of connects to and the out-pin of connects to . If connecting to and to does not change the functionality of , then and are noninverting swappable. If connecting through an inverter to and through an inverter to does not change the functionality of , then and are inverting swappable.
The notion of noninverting and inverting swappable pins corresponds to NES and ES. When there is no ambiguity, we use "swappable" to denote both noninverting swappable and inverting-swappable.
A. Identification of Swappable Pins
The main purpose of the GISG extraction is to explore the functional symmetry inside a network. Equipped with the information of functional symmetry, we can find wires that can be exchanged without changing the functionality of the network.
Lemma 6: If two in-pins and are covered by the same GISG rooted at and and do not properly contain each other, they are swappable.
The reason for the nonproper containment constraint is as follows. Since the underlying structure is fanout-free, if one path properly contains the other, it implies that swapping of these two pins will create loops and hence cause malfunction. Take Fig. 3(a) as an example. Pins and are covered by the same GISG rooted at . However, the path from to is properly contained by the path from to . Swapping of these two pins will create a cycle in the circuit. In the following, we implicitly assume that target pins fulfill the nonproper containment constraint.
Let the in-pin of a gate is AND-OR-reachable from the gate .
is the value set at during direct backward implication from . Proof: By applying , , or , , the inputs to will always have value combinations such as D and D, and , or D and , because XOR gates along the path can never stop the fault effect. In either case, no fault effect can be seen at the output of the . That is, inputs to XOR-reachable gates are always both ES and NES.
. Fig. 1(a) shows an GISG rooted at . There, and
. By Lemma 7, we know and are noninverting swappable. That is, they can be swapped without introducing inverters. This is shown in Fig. 1(b) .
B. Cross-Supergate Swapping
Previous theorems show that pins that are covered by the same GISG are symmetric and, hence, swappable. Further analysis shows that groups of pins belonging to different implication supergates may also be swappable.
Definition 10: (DeMorgan transformation on an implication supergate) Let SG1 be an implication supergate and {AND, OR}. We define operator DeMorgan (SG1) as the addition of inverters to all inputs and the output of SG1.
Theorem 2: Let SG1 and SG2 be two implication supergates. and are the sets of fanins to SG1 and SG2 and . If the outputs of SG1 and SG2 are symmetric and type(SG1) and type {AND, OR}, and are swappable under DeMorgan transformation of SG1 and SG2.
Proof: Without loss of generality, assume type AND and type OR. Since output(SG1) and output(SG2) are symmetric, these two pins are swappable. However, instead of swapping these two pins, we can connect fanins(SG1) to DeMorgan(SG2) and connect fanins(SG2) to DeMorgan(SG1). That is, we make type(DeMorgan(SG2)) equal to type(SG1) and type(DeMorgan(SG1)) equal to type(SG2). If both SG1 and SG2 are of the same type, no DeMorgan transform is necessary. This swapping and transformation procedure clearly preserves the functionality of the network.
. The example in Fig. 6 shows the process of cross-supergate swapping.
V. DELAY OPTIMIZATION
Two types of postplacement performance optimizations are made possible by exploiting GISGs.
• Wire length reduction: Fig. 7(a) shows a set of placed gates and two signals and coming from geometrically fixed locations. Replacement of gates , , and cannot lead to better solutions because of the connectivities from other fixed instances. However, swapping and can clearly reduce the wire length. If either net or is critical, reducing wire length directly contributes to loading reduction. Congestion may also be relieved.
• Logic level reduction: In Fig. 8 , let be the late-arriving signal. Swapping with reduces the number of logic levels the late signal has to travel and, hence, reduces the overall delay.
A. Timing Model
We assume that final routing has not been done yet, that only placement has been completed. Therefore, a net model is necessary to estimate the delay along the interconnect. We adopt the analytical model proposed in [20] . Assume all pins have known coordinates after placement. Each net is modeled as a star: the center of the star is the center of gravity of all its terminals. A net is divided into several segments: from the source to the star center and from the star center to each sink. Each segment is modeled by a lumped RC. We use the Elmore model [9] for delay calculation. Since the distance from the star center to each sink may vary, each sink may have a different delay from the source.
We use a load-dependent model for gate delay. The delay from an input pin to an output pin is Here, is the load capacitance at the output of a gate , and is the intrinsic delay from in-pin to the out-pin of , and is the load-dependent coefficient. Each and has two values corresponding to the rise and fall transitions, respectively. [1] . These logic structures are first removed from the existing placement and then placed back to available slots after resynthesis. However, there could potentially be many cell overlapping in the placement that would need to be resolved with an engineering-change-order (ECO) placer. This would inevitably introduce undesirable perturbation to the existing placement because the resynthesis is based on the timing constraints from that placement.
B. Problem Formulation and the Algorithm
We consider GISG-based logic restructuring to be best suited for a postplacement scenario since the restructuring involves only wire swapping and some inverter insertion. Our goal is to minimize the maximum arrival time among all primary outputs while limiting any perturbation of the existing placement solution.
We have observed that GISG-based rewiring for performance optimization is similar to the gate-sizing problem. To use gatesizing for performance optimization, each gate in the netlist can be sized either up or down to its logically equivalent gates from the technology library. In our case, we first perform generalized supergate extraction to get a netlist of GISGs. For each GISG, a set of swappable pins is identified. Each swap can be viewed as a different library implementation of the GISG. Thus, the problem of performance-driven GISG-based rewiring is transformed into a gate sizing problem on the GISG netlist. Cross-supergate swapping is not considered in the current formulation since the occurrences of such supergates is relatively less in the test suite we used.
Our algorithm is based on the gate-sizing heuristics proposed by Coudert [6] . The idea is to maximize the minimum slack through iterative neighborhood search and relaxation. Our overall algorithm is shown in Fig. 9 . The function to the left labeled is called by the function to the right labeled as an internal routine. We first discuss the function . Each gate-resizing and wire-swapping choice is viewed as a possible move for the optimization and is annotated with a value called fitness, which is the potential gain in terms of the cost function in the local neighborhood when the move is executed. The type of the cost function is specified by an external variable , which can be either "S" or "TS." When , the cost of a neighborhood is defined as the minimum slack among all nodes in the neighborhood. When , the cost of a neighborhood is defined as the summation of slacks of all nodes in the neighborhood. While is the direct target for delay optimization, it has been observed in [6] that offers a good measure for relaxation. For each GISG in the netlist, we find the best move based on the fitness value calculated over the local neighborhood. For that is trivial, we consider resizing as the set of the candidate moves. After finding best moves for each , the algorithm sorts all of the s into a sequence with respect to their fitness values. A series of best moves is determined by traversing the sequence of moves. We do not stop at the first maximum found when traversing the sequence. Instead, we traverse the whole sequence and determine the best sequence in order to escape from local optima. This is implemented in the BestMultipleMoves(N) function. After applying the moves, the set of gates that are in the neighborhood of the perturbed gates is put into the update list as candidates for the next iteration. The function stops when convergence conditions are met-either the iteration limit has been exceeded or the improvement is lower than a given threshold.
The second function calls in an iterative fashion by switching the optimization goal between S and TS. In the phase, we seek the best move which maximizes the minimum slack in its neighborhood. In the , the best move is taken to maximize the summation of all slacks in its neighborhood. The goal of this phase is to speed up the network globally and escape from a local minimum. These two phases iterate until no further improvement is possible [6] .
C. Experimental Results
Our prototype tool Rewiring After Placement usIng easily Detectable Symmetries (RAPIDS) has been implemented on top of SIS 1.3 [23] and tested on both Microelectronics Center of North Carolina'91 and International Symposium on Circuits and Systems'89 benchmark suites. Sequential circuits are treated as combinational ones with all sequential elements removed. All benchmarks are optimized by SIS script.rugged and mapped by command "map -n 1 -AFG." We use a commercial 0.35-m standard cell library consisting of INV, BUF, NAND, NOR, XOR, and XNOR with number of inputs ranging from 2 to 4. Each type has four different implementations. The mapped netlist is fed to a commercial timing-driven placer. We set the required time at primary outputs by taking 80% of the preplacement arrival time. This figure is used as the timing constraint to the placer. Cell locations are extracted after placement. To model interconnect, we use 2 pf/cm for unit capacitance and 2.4 cm for unit resistance. All experiments are performed on a Sun Ultra10 with 128 MB of memory. We do not perform cross-supergate swapping in our experiment. Also, we do not utilize the redundancies found during supergate extraction.
To evaluate the effect of using GISG-based rewiring for delay optimization, three algorithms have been implemented:
• : Use only GISG-based rewiring; • : Use only gate sizing [6] ; • : For gates covered by nontrivial GISGs, use GISG-based rewiring. Otherwise, consider gate sizing for that gate. To couple these two choices tightly, our algorithm works on a netlist of extracted GISGs. Each possible swap in a GISG can be viewed as an electrically different instance of a functionally equivalent implementation from a virtual library for this GISG. Table II shows the results of our experiments. Column 3 shows the initial critical path delay after placement. Columns 4-6 show the delay improvement by , , and , respectively. Columns 7-9 show the runtime (in seconds) for these three algorithms. Columns 10 and 11 show the percentage of increase/decrease in the area. We consider only the area taken by gates in the netlist. Column 12 shows the percentage of gates covered by nontrivial GISGs. On average, 27.6% gates are covered. Column 13 shows the largest number of inputs among all GISGs in the netlist. In benchmark , a GISG with 43 inputs exists. Column 14 shows the number of redundancies found during GISG extraction. The last column shows the number of symmetries found in each benchmark circuit-a considerable number of them. The ability to explore such flexibility suggests a huge potential for optimization. The results show that GISG-based rewiring and gate-sizing complement each other. Applying , the total improvement is often larger than the sum of the individual improvements. Our explanation is that or may easily get stuck in local optima because critical paths often conflict with each other. On the other hand, and can help extricate each from local optima by exploring a much larger solution space.
The results also show that the area is often reduced after either or . For most benchmarks, achieves better delay improvement than alone while reducing the area more. This further confirms our approach of sizing only gates covered by trivial GISGs. We assume that such minimal area perturbation can be easily resolved by an ECO placer. Also, all benchmark runs finish within three minutes of CPU time.
Although in our experiment we resized only gates covered by trivial supergates, we can also resize gates covered by nontrivial supergates. This resizing can potentially enlarge the solution space for better performance, a potential that is shaping the direction of our future study.
VI. DELAY-CONSTRAINED POWER OPTIMIZATION
In this section, we present a delay-constrained power optimization algorithm using functional symmetries. Some power estimation techniques are reviewed first.
A. Power Model
The average power dissipation in a CMOS gate consists of three major factors The first term is the power consumed when charging and discharging the output load of the gate. It depends on the output loading capacitance and the toggle rate (number of transitions per time unit). The second term indicates the short circuit current during the CMOS gate's switching. It depends on the input transition time, internal load, and the toggle rate. The last term is the power consumed due to the device leakage current. Since and are more device-related, we need only consider the optimization of , which is the dominating factor of [17] .
The toggle rate depends on the relative delays of signals propagating through the circuit. A gate can undergo a series of transitions before settling into a steady state. However, it is computationally very expensive to determine this effect, as it involves an event-driven simulator with all timing information taken into consideration [17] . In order to use the estimation as a subroutine inside our algorithm, we choose to neglect the effect of glitching and use a zero-delay model instead.
Najm has introduced the notion of equilibrium probability and transition density for power estimation [18] . The equilibrium probability of a signal , denoted by , is the fraction of time when is evaluated to logic 1. The transition density of , denoted by , is the average number of transitions per unit time. Under spatial and temporal independency assumptions, an efficient algorithm was introduced to propagate the density values from the primary inputs throughout the circuit. To see how the propagation algorithm works, recall the concept of Boolean difference: if is a Boolean function that depends on , then the Boolean difference of with respect to is defined as Here, represents the Boolean exclusive-or function. The Boolean difference is the XOR of the positive and negative cofactors with respect to . Essentially, is the condition that if there is a transition at , there is a corresponding transition at . For example, let be a two-input AND gate (i.e.,
). The Boolean difference of with respect to is . So, when , any transition at will cause a corresponding transition at . It has been shown in [18] that under the spatial independency assumption, the transition density at the output of an -input function can be calculated by the following:
Intuitively, is the summation of each of the inputs' transition densities multiplied by the probability of setting other side inputs for the propagation of the transition. The overall power consumption estimation under this measure is then where is the supply voltage, is the load capacitance seen from node , and is the total number of nodes in the circuit.
B. Property of Functional Symmetry
We now analyze the effect of wire swapping on transition density.
Theorem 3: Let be a function defined over support set and be of NES (ES) with respect to variables , . That is, and can be swapped without (with) adding inverters. Let be the transition density at after swapping and . Then, the transition density after the swap will equal the transition density before the swap. That is, . Proof: Without loss of generality, we assume is of NES with respect to variables , . That is, (4) The case for ES can be proved similarly. The swap is illustrated in Fig. 10 .
By definition, the transition density of before swap is For simplicity, we denote as the last two terms of .
That is
The new transition density of after the swap is 
Shannon Expansion)
By assumption, and are of NES, so (4) holds. Plugging (4) into (5) and (6), we obtain . This result once more proves the equivalence between and . Finally, we conclude that the new transition density after the swap is the same as the one before the swap.
. The importance of Theorem 3 is twofold. First, it provides the theoretical foundation for the effect of symmetric swapping on transition density. Changes in transition density are guaranteed to be bounded inside the associated implication supergate. Second, at each of the GISG roots, transition densities serve as a set of fixed points throughout the optimization. As a result, our algorithm can take a global view of the whole optimization process. The detailed algorithm will be discussed in the next section.
C. Algorithm
The algorithm for delay-constrained power optimization is an extension of the algorithm developed for delay optimization. Wire swapping contributions to delay-constrained power optimization are as follows.
1) The transition density of gates covered by the same GISG can potentially be changed. Thus, it is beneficial to lower the transition density at gates with high loading by wire swapping. 2) Gate resizing lowers the power consumption at the risk of delay penalty. When the allowable delay penalty is reached, no further power reduction is possible.
The solution space for tradeoff can be enlarged by covering the delay losses with wire swapping, which has been shown to be good for delay optimization. For GISGs that are nontrivial, we consider each swap as a possible move. For a trivial GISG, implementations of this gate from the technology library form the set of possible moves. Hence, a move in our algorithm can be either gate resizing or wire swapping. Now, we analyze the effect for each type of move. Basically, the resizing will affect the slack and power of the circuit under optimization. In [6] , Coudert observed that the effect on slack tends to be confined within the local neighborhood of the move. Here, we concentrate on the effect of a swap. The change in slack can be calculated by updating the arrival/required time in the local neighborhood. The change in power consumption comes from two sources: 1) the loading capacitances of the swapped pins are changed and 2) the transition densities of the fanouts of the swapped pins are changed up to the root of the supergate. This effect can be efficiently calculated by an event-driven procedure.
We adopted an approach based on a benefit/penalty function for the delay-constrained power optimization problem by defining the fitness function of each move as follows:
Fitness otherwise where is the change in minimum slack in the local neighborhood, is the change of power consumption of the whole circuit, and and are predefined constants. A move is assigned zero fitness value (gain) if the move causes both the slack and power to become worse. Moves with zero fitness are immediately discarded. Otherwise, the gain is defined as a function depending on both and . In general, we want to choose a move that trades as little to achieve as much as possible. A move can be either a gate resizing or a wire swapping. They are distinguished only by their fitness values.
The overall algorithm is similar to the one used for delay optimization shown in Fig. 9 . The cost of the network is redefined as the primary optimization goal-power consumption. The fitness function defined above is used for delay-constrained power optimization.
D. Experimental Results
The experimental setting is the same as the one used for delay optimization. Table III shows the results. The first column lists the name of each benchmark. Column 2 shows the number of gates in the mapped netlist. To have a fair comparison with the gate-sizing-only technique, we preprocess the circuit by minimizing the critical path delay, using only gate sizing. Columns 3-5 show the corresponding delay, power, and area after timing optimization. Columns 6 and 7 show the corresponding power reduction for the gate-sizing-only approach and for our hybrid approach when the delay constraint is set at 5% worse than that of the preprocessed circuit. To demonstrate the result from another angle, we set the power constraint to be 10% less than that of the preprocessed circuit and show the delay tradeoff. The results for both approaches are shown in Columns 8 and 9, respectively. We also show the percentage of area perturbation and CPU time (in seconds) when deriving the power-delay tradeoff curve from Columns 10 to 13. That is, we show the runtime from the preprocessed circuit with the best possible slack compared with the circuit with the best possible power reduction. The area perturbation is reported for the fixed delay experiment.
The results clearly show the benefits of using functional symmetry together with gate sizing for postplacement power-delay tradeoff. In all benchmark runs, the hybrid approach has achieved better power reduction with less delay penalty. For example, in benchmark C6288, the gate-sizing-only approach reduces power by 7.1% at 5% of delay penalty. At the same delay penalty, the hybrid approach reaches as much as 25.5% reduction in power consumption. On the other hand, delay penalties reach 7.0% and 1.2%, respectively, for the gate-sizing-only and the hybrid approach when the same benchmark is reduced to 90% of its original power consumption. This shows the great potential for applying our approach to trade lower delay penalty for better power reduction. On average, at 5% delay penalty, our hybrid approach achieves 12.6% power reduction, as compared with 8.3% of the gate-sizing-only approach. At 10% power reduction, we tradeoff only 1.4% of delay, whereas using the gate-sizing-only approach we incur 5.7% delay penalty. In our experiment, we considered only trivial supergates for resizing. Further improvements are still possible by relaxing this constraint.
Our approach can potentially explore a much larger solution space than can be obtained by the gate-sizing-only approach. This can be seen in the power-delay tradeoff curves in Fig. 11 . It is easily seen from the curves that our hybrid approach can quickly reach a significant power reduction while incurring only a very small delay penalty. In Fig. 11(d) , the processed benchmark alu2 has power level at 2104 and delay at 9.07. Our hybrid approach immediately finds a solution with a delay of 8.95, while consuming the same amount of power. This shows that because we have a much larger solution space, we can have much more freedom to trade less delay for more power reduction.
VII. DELAY-CONSTRAINED RELIABILITY OPTIMIZATION
HCEs have been studied extensively in the past few decades [10] , [13] , [27] . Efficient techniques for accurate transistor-level reliability simulations have been implemented in both academic [25] and commercial tools [3] . However, transistor-level simulations of large industrial circuits are computationally too expensive to be feasible. A probabilistic approach was proposed in [14] to estimate the degradation effects on timing. Recently, a ratio-based gate-level degradation model was proposed in [28] as a higher level abstraction. Each cell from the technology library is precharacterized for its degradation behavior under various stress conditions. For an excellent review, refer to [13] .
Design-for-reliability (DfR) techniques considering HCEs fall into two categories. One category includes such techniques as transistor reordering and resizing [7] , technology mapping [5] , and technology-independent factorization [22] to minimize the maximum hot-carrier degradation effects among all transistors in the circuit. That is, each transistor in the circuit is labeled with a relative degradation factor and the optimization goal is to minimize the maximum of these factors. This goal targets improvement of the mean-time-to-failure (MTTF) under the assumption that if any device in the circuit fails, the whole circuit fails. The other category of techniques includes the method proposed by Li et al., which performs input pin reordering and gate resizing [15] to minimize the impact of performance degradation on the entire circuit. The idea is that not all devices in the circuit are of equal importance as far as overall performance is concerned. Devices not on the critical paths can potentially tolerate more degradation without affecting the overall performance.
However, all of these techniques operate at the gate/transistor level without knowledge of the physical layout information, which has a tremendous impact on device degradation. For example, input slew rate to a transistor and effective output switching are identified as the most important factors [12] , [15] determining device degradation. Due to the resistive behavior of deep submicron interconnects, estimation of slew rates at the gate level is very inaccurate when the placement and routing information is unknown. Also, because of the underlying Boolean functionality, some gates experience more switching than others. Switching activity cannot be controlled for optimization purposes without changing the logic structure of the circuit.
A. Reliability Model
In this section, we first review the ratio-based degradation model from [26] and [28] .
Let and be the fresh and aged pin-to-pin signal delays. is the aged-to-fresh signal delay ratio which characterizes the overall pin-to-pin delay degradation of a gate due to the HCE. These variables are defined for each transition type (rise or fall) in each signal path of the logic gates. The relationship between , , and is shown in (7) is a value larger than one and can be characterized by the following equation: (8) where is the number of transistors in series and is the aged-to-fresh delay ratio when only pin is under stress. It is defined as follows: (9) In this equation, is the slew rate of the input pin. is the load capacitance of the gate output.
is the number of effective switchings of the input pin. By "effective," we mean that the input-pin switching leads to an output-pin switching. We can view as a degradation factor of the th transistor in se- ries in the conducting path. Conceptually, slower slew rates can put transistors in undesirable bias conditions for longer periods of time, larger load capacitances can stress the transistor longer during charging and discharging, and more effective switching can stress the transistors more often. These stresses cause transistors to wear out more frequently. Function is determined in the process of transistor level simulation. The results are used to build a three-dimensional table for later reference. Equation (8) deserves further explanation. Essentially, it decouples/simplifies the effect of degradation on the conducting path into individual contributions from transistors along the path. For example, consider the high-to-low delay of a two-input NAND gate in Fig. 12(a) . As a first-order simplification, transistors in series are regarded as resistors in series in Fig. 12(b) . When only M1 switches in the whole lifetime, R1 degrades to with and the delay degradation ratio can be written as follows: Fig. 12(c) shows the effect. Similarly, if only transistor M2 switches over the whole lifetime as in Fig. 12(d) , can be written as follows:
Now, after characterizing degradation of each of the individual transistors, the effect of both degraded transistors M1 and M2 along the conducting path can be added [see Fig. 12(e) ] and the resulting is as follows:
This equation is for the case when equals two. When the number of transistors in a series is , this equation can be generalized yielding (8) . Based on this extended pin-to-pin delay model, full chip timing/reliability simulation is demonstrated to be 2 to 4 orders of magnitude faster [28] , while accuracy is within 1% of the transistor-level counterpart.
B. Problem Formulation
Even though large-scale, fast-yet-accurate reliability simulation is feasible, there are no systematic ways to correct the timing degradation found. In [12] , design-for-reliability considerations at circuit level are given as a set of guidelines. Here, we move one step further by considering the design for reliability issues at the logic level guided by accurate layout information. Our main argument is that circuit level consideration by itself is not adequate. From (9), it is clear that degradation is affected by three variables: namely, , , and . In a digital circuit, these three parameters could potentially vary dramatically among different gates. In other words, the stress is uneven. Some transistors wear out faster than others. For example, because of the underlying Boolean logic, some gates in the circuit switch much more frequently than others, and some gates bear larger loads than others. Circuit-level techniques simply cannot address all of these phenomena.
We now discuss the effect of gate sizing, pin reordering, and rewiring on HCE and circuit performance. To improve of an input pin, the driver of the pin can be sized up for better driving capability to improve the slew rate. However, a larger driver causes larger loading on the preceding stage, which needs to be taken into account in the tradeoff. Rewiring can be used either to reduce the loading of the driver by minimizing the interconnect loading, or simply to replace the driver by another which has better driving capability or which is physically closer to the sink pin, to reduce the interconnect length. The effective switching can be changed only by rewiring the netlist. Another HCE effect concerning the ordering of transistors has been observed in [12] and [15] . For example, in Fig. 13(a) , the top nMOS transistors that are directly connected to the output node have the potential of experiencing the most damage if they switch last. This is because the stress on nMOS transistor is directly related to (drain-to-source voltage difference). Suppose there is an effective transition on pin (other inputs have already arrived). When connects to the output node, is larger than in the case when is closer to the ground. This effect is due to the charge redistribution on internal nodes. In Fig. 13(b) , when is connected close to the ground and the other two transistors are conducting, the charge stored at the output is redistributed to the two internal parasitics. This effectively lowers the -induced HCE damage on . However, conventional timing optimization techniques tend to put the last arriving signal closer to the output node to minimize the overall arrival time. This tradeoff also needs to be considered during optimization.
Let and be the nondegraded and degraded critical path delays of a design . We formulate the following problem:
Fresh Delay Constrained Aged Delay Optimization Problem (FDCADOP): Instance: We assume that we are given a placed and routed standard-cell design and a hot-carrier degradation, precharacterized, standard cell library . Let be the family of the sets of pins that are identified as functionally symmetric and which can be swapped without changing the overall functionality of the design .
Configuration: Each gate can be resized by a functionally equivalent though electrically different cell from . Each functionally symmetric pin pair can be swapped to change the logic structure of . We consider pin reordering to be a special case of functional symmetry.
Optimization: Let be the new design after gate resizing and pin swapping from the original design . The goal is to find a that satisfies the following requirements:
minimize Essentially, we want to minimize the performance degradation of the aged design by redistributing the stress to logic elements that are not on the timing-critical path. However, we are not willing to sacrifice any performance loss in the fresh design. We observe that simply optimizing (traditional delay optimization goal) does not necessarily lead to an optimized solution of . This is because the optimized might place to unfavorable stress conditions on the transistors along the critical path. As discussed in the previous section, various tradeoffs have to be considered simultaneously, and we need a unified algorithm that takes into account both the performance requirement and the aging effect to resolve this situation. Our algorithm will be detailed in the next section.
C. Algorithm
To solve the FDCADO problem, we adopt a probabilistic approach based on [18] to estimate the effective switching of each pin of the gate. Under the spatial and temporal independency assumptions, for a pin of gate under zero-delay model can be expressed as where is the per-unit time transitions number of the input pin , is the probability of propagating a transition from pin to the output of gate , and is the total time period. We assume that half of the transitions are low-to-high and half are high-to-low.
The algorithm is very similar to what we have used for delayconstrained power optimization. We change the fitness function to:
Fitness otherwise where and are the changes of minimum slack caused by the move in the local neighborhood, defined as gates within a user-specified level limit from the source of the move. The upper part of the fitness function defines the situation when the move degrades both the fresh and aged circuits. This undesirable kind of move is assigned zero fitness value, which means it will never be executed. On the other hand, the exponential dependency on gives priority to moves which accelerate both the fresh and aged circuits. Typically, is chosen to be much larger than to penalize the situation in which the fresh delay is degraded while the aged delay is improved. When designing our algorithm, we intentionally made no assumption about the property of the function in (9) . That is, no matter how is characterized, either by analytical equation, empirical formula, or simply a table-look-up method, our algorithm still applies. This further demonstrates the robustness of our approach.
D. Experimental Results
The experimental setting is the same as that in delay and delay-constrained power optimization. To characterize the cell library with aging information, we use the transistor level aging simulator BERT [25] together with HSPICE and verify it with analytical equations obtained from [13] for an ten-year period. We characterize the aging effect only on NMOS transistors since the degradation of PMOS transistors is relatively negligible [13] for the technology we are using.
Three algorithms have been implemented to show their relative strength in optimizing the aged circuit under fresh delay constraint: 1) pin reordering; 2) gate sizing; and 3) a hybrid approach discussed in the previous section. Experimental results are shown in Table IV . The first column lists the name of each benchmark. The second and third columns show the fresh and aged delays of the original circuit after placement. This fresh delay is used as the timing constraint for the aged circuit optimization. Column 4 is the percentage of performance degradation due to circuit aging. Columns 5 and 6 show the aged delays after pin reordering and the corresponding degradation percentages as compared with the original fresh delay. Columns 7 and 8 show the aged delays after gate sizing and the corresponding degradation percentages. Column 9 shows the aged delays after our hybrid approach, and column 10 gives the degradation percentages compared with the original fresh delay. Columns 11-13 show the CPU times in seconds for pin reordering, gate sizing, and our approach, respectively.
The results clearly show the advantage of considering logic restructuring in combination with traditional techniques. On an average, the percentage of degradation can be lowered to be within 1% of the original fresh delay using our technique, whereas pin reordering and gate sizing result in degradation of 8.8% and 4.6%, respectively. The percentage of area change caused by gate sizing is within 3% and is assumed to be amenable to corrections by an ECO placer.
VIII. CONCLUSION AND FUTURE WORK
Combining the theory of functional symmetry, ATPG, and supergates, we have developed a unified framework for symmetry identification in Boolean networks. Application for postplacement delay optimization has also been demonstrated. On average, the generalized gate sizing proposed here achieves 9% timing improvement at a very low computational cost and minimum perturbation of the existing placement solution.
Postplacement delay-constrained power optimization is also studied. Theoretical results on the use of functional symmetry and its effect on transition density are formally stated. With the GISG roots serving as fixed transition density points during the logic restructuring, we have developed a restructuring approach that takes a much more global view than existing greedy restructuring approaches. Our technique can be distinguished from the existing techniques in several aspects. 1) Instead of trying to globally change the transition density of the circuit, it keeps a set of fixed transition density points. This enables wire swapping to cover the delay losses when optimizing for power in a global fashion. 2) Performing optimization at postplacement stage allows us to accurately model the interconnect-induced delay and carefully trade it for power.
Even though we use transition density based on [18] as our primary means for power estimation, our approach is not limited to it. Other estimation techniques, such as simulation or symbolic techniques, can be used for better accuracy. Experimental results show that our technique achieves much better power-delay tradeoff when compared with the gate-sizing-only approach. At postlayout stage, trading as little delay penalty as possible for large power reduction is very important, as any delay penalty might lead to failure to meet the performance target. A timing optimizer targeting directly the circuit-aging behavior is also proposed. Combining functional symmetry based on rewiring, pin reordering, and gate sizing, our approach shows much better results than the individual traditional approaches. On the average, we can minimize the impact of circuit aging to be within one percent of the original design specification.
In [11] , a combined buffer insertion and redundancyaddition-and-removal (RAR) technique is proposed for postlayout performance optimization. Supergate-based rewiring, gate sizing, RAR, and buffer insertion can naturally be integrated to form a powerful back-end optimization flow with minimum perturbation on the current placement solution. As designs migrate to the deep submicron technologies, the ability to perform incremental logic restructuring after placement becomes extremely important. Our integrated technique shows great promise for solving the timing closure problem.
