Abstract-We describe a placement-level decoupling capacitance (decap) insertion technique whose objective is to reduce power noise, taking into account circuit timing. Our approach consists of prediction and correction steps. Before placement, we estimate the power noise of each cell considering switching frequency of cells that, after placement, will most likely be in the neighborhood. If a frequently switching cell has neighbors that switch infrequently, it is unlikely that this cell will suffer from a power-noise problem. Based on the cell power-noise estimation, we add decap padding to each cell. Then, we invoke a standard cell placement tool and perform power grid analysis. We eliminate the power grid noise by gate sizing. Our technique can allocate decaps to improve power noise, power consumption, and timing. We propose two gate-sizing algorithms. The first one uses a sequence of linear programs (SLP) formulation, and the second one uses a budgeting-based heuristic algorithm. The SLP algorithm can produce better power-noise results than the heuristic, at the expense of runtime. Experimental results show that our techniques can effectively reduce power noise and still meet timing constraints.
I. INTRODUCTION

M
ODERN designs manufactured in advanced technologies are very sensitive to power noise. Aggressive technology scaling increases average current density and power-noise magnitude. Reduced supply voltage causes power voltage drop to consume an increased portion of the ideal voltage supply level, which affects timing of CMOS gates. It is therefore important to address timing issues related to power noise.
Decoupling capacitance (decap) insertion is an effective way to reduce power noise. Decaps are intentionally inserted in the layout and attached to the power grid. Decap locations are important to ensure effectiveness in reducing power noise; thus, it is usually desirable to move them closer to the noisy areas.
In [4] , [5] , [10] , and [15] , decap allocation optimization is addressed at the floorplan level. In [4] , the authors use iterative transient analysis and optimize decap locations. In [5] , the authors formulate the decap placement as a network flow optimization problem. In [10] , the authors distribute decaps proportionally to the values of currents drawn in each region. In [15] , the authors observe that an effective way to allocate decaps is to distribute them to all grid nodes, assigning more decaps to grid nodes of the blocks with high switching rates. Some previous works [3] , [9] propose to reduce power noise by spreading the frequently switching cells evenly across the chip to eliminate hot spots. In [3] , the authors include thermal cost function in a partition-based placer. In [9] , the authors modify a quadratic placer to optimize both total power consumption and heat dissipation. Postlayout decap reallocation algorithms are proposed in [11] and [12] . Both [11] and [12] use power-noise sensitivity analysis to decide decap locations in the layout. In [12] , the authors compute the sensitivity and conduct decap reallocation only once. In [11] , the authors compute the sensitivity and move decaps many times for further improvement. If in a certain area after the initial placement the power noise is severe, significant decap reallocation is required. However, drastic changes of decap locations after placement should be avoided because timing, wire length, and other circuit properties might be significantly changed. Combining decap allocation with placement increases the number of placeable objects, which, in turn, increases the complexity of placement. The quality of decap allocation will also be seriously impacted by the early placement partition decision, which usually relies on incomplete layout information. No previous works on decap allocation have considered timing, although voltage drop may seriously impact a chip timing.
In this paper, we address the decap allocation problem at the placement level. The floorplanner distributes the available decaps among the macroblocks. Our goal is to find the final locations for decaps inside the individual blocks. We propose a timing-aware power-noise reduction scheme consisting of prediction-based decap allocation and gate-sizing algorithms. The flow of our noise reduction scheme methodology is shown in Fig. 1 . First, we execute the "prediction" step. The goal of this step is to select the right amount of decap to be placed in the neighborhood of a cell. For each cell, prior to placement, we predict the size of the required decap and pad the cell accordingly (as shown in Fig. 2 ). The better we can predict power-noise-affected cells before placement, the fewer decap reallocations will be required after placement, and the better 0278-0070/$25.00 © 2007 IEEE use we can make of the available decap area. The decap size prediction is based on the cell current consumption (CC) and the expected placed-cell neighborhood. If a cell and its placement neighbors have high CC, it is likely that this cell will suffer from excessive power noise. It will be less accurate to predict this cell's need for decap based only on its switching while ignoring its neighbors. We predict a cell neighborhood based on the wire length prediction and circuit structure analysis. Mutual contraction is utilized as the wire length prediction metric [6] . Previous work on the wire length prediction will be explained in later sections. Although we focus on cell-level decap padding in this paper, our prediction-based padding method can also be applied to mixed-size or macrocell netlists.
After the cell padding, we perform placement followed by the power grid analysis to obtain new circuit delay information. The second optimization step is "correction." We propose gatesizing algorithms to improve power noise, power consumption, and timing after placement. Cell power noise is not only affected by placement of its neighbors but also greatly influenced by the grid design and power pad location. However, these factors are not easily predictable. We need a gate-sizing step to help us meet power noise and timing goals. Our gate-sizing algorithms also consider decap-location optimization. Because the total chip area is fixed, if a gate area is changed, the decap area will be changed accordingly. We need to consider gate sizing and decap-location optimization together.
We propose two new gate-sizing algorithms. The first algorithm linearizes the original nonlinear expressions for gatedelay calculation and uses a sequence of linear program (SLP)-based gate-sizing approach. The optimization is done by solving a linear program (LP) in each optimization iteration. The second gate-sizing algorithm is an iterative budgetingbased heuristic. In each iteration, cell sizes are adjusted in a way that no timing violation occurs. The heuristic algorithm can achieve results close to those of the SLP method; however, the runtime is much smaller. In our gate-sizing algorithms, we do not compute noise sensitivity as in [11] and [12] because we include the grid simulation in the optimization process. The voltage drop simulation results are used to measure cell powernoise sensitivity.
The contributions of this paper are as follows. We point out that decap assignment should not be limited only to inplacement and postlayout optimization. Prelayout decap prediction can significantly improve the results. We derive gate-sizing algorithms that take into account decap allocation and timing. Experimental results show that our power-noise reduction techniques are effective. The gate-sizing formulation is also applicable to reducing power consumption and meeting timing constraints.
This paper is organized as follows. In Section II, we show the background for modeling power grid, quantifying power noise, and predicting wire length. In Section III, we discuss decap prediction and cell padding. In Section IV, we describe the gate-sizing correction process. In Section V, we show the experimental results. We conclude the paper in Section VI. Our work in this paper is extended from [7] : We follow the same prediction-correction step and also add a new heuristic gatesizing method. 
II. BACKGROUND
In this section, we describe the models used in this paper. We also explain the mutual contraction metric, which is used for prelayout wire length prediction.
A. Modeling Power Grid, Decap, Cell Delay, and Power-Noise Measurement Metrics
Power grid can be modeled as a mesh composed of resistors, capacitors, current sources, and voltage sources, as shown in Fig. 3(a) . The chip layout is divided evenly into regular blocks. One grid node corresponds to a partition block. One current source connected to a grid node models the current drawn by the cells in the corresponding block. For simplicity, the current source waveforms are modeled as triangles, as shown in Fig. 3(b) . The current drawn by a current source is determined by a summation of currents drawn by those cells within its block. The average switching current of a cell is determined by its switching frequency and switching capacitance. In our experiments, the time between t s to t e [ Fig. 3(b) ] is set to 1 ns. During other times, the currents flowing through a cell are very small. Our modeling of the power grid and current sources is similar to that of [12] .
All decoupling capacitors in a block are lumped and represented by a capacitor connected to the grid node. There are two types of decaps, namely: 1) intentionally inserted decaps and 2) background decaps from the standard cells. Standard cell decaps can be computed from cell types and sizes using information in a cell library. If a decap is inserted far away from a noisy area, it may not be helpful to ease the power noise (as explained in [12] ). Decap efficiency-degradation effects will be considered naturally in the grid simulation. If cells in a block switch more frequently, the block current drawn will be larger, the block voltage drop will increase, and the block will require more decaps to reduce power noise. A simulator will determine how serious the voltage drop is for each block.
Power pads are connected to certain grid nodes. They are modeled as voltage sources. Using flip-chip packaging, pads can be inserted at internal grid nodes. Their locations are not limited to grid periphery; they can also be inserted in the interior of the chip. For simplicity, in our model, we insert power pads uniformly on the grid. By performing transient analysis, we can calculate the voltage profile of each grid node. Because of the grid resistance, when a large current is drawn, a big voltage drop may follow, as illustrated in Fig. 3(c) . Power grid voltage drop affects chip performance. We assume that a tolerable voltage drop threshold value is known. A voltage drop lower than the threshold is considered safe, i.e., not likely to cause timing violations or system malfunction. In our experiments, the voltage margin threshold is set at 5% from the ideal voltage. A typical noise margin can be set between 5% and 10%.
When voltage drops at a power grid node, the delay of cells connected to this node changes. The cell delay-Vdd level relation can be modeled as a linear function. In Fig. 4 , we show simulation results of buffer delays for various Vdd values. We use a buffer of size 10 in 130-nm technology. Results of this experiment suggest that in the Vdd range of 1.2-2.0 V, the delay-Vdd relation is indeed close to linear. This relationship has been also observed in [2] , wherein the authors model the pin-to-pin cell delay as an inverse linear function of the supply voltage. The slope of the linear function can be characterized by simulation. We use this linear model to calculate the cell delay.
We use three metrics to measure the power noise. The first metric is the deepest voltage drop on all grid nodes. This metric tells us the magnitude of the worst voltage drop on the chip. The second metric is the number of grid nodes that have a voltage drop greater than the threshold value. This metric reveals the overall power-noise condition of the chip. We define the excess noise drop area (ENA) for a node as the size of the area between the voltage margin threshold and the voltage drop. In Fig. 3(c) , ENA is the shaded area above the voltage drop curve. The third metric is the summation of ENA for all grid nodes. This third metric complements the second metric and gives us a better picture of the chip power noise. These three metrics quantify the local and global power noise. In Section III-D, we will show the values of these three metrics.
B. Mutual-Contraction-Based Wire Length Prediction
Mutual contraction introduced in [6] is a metric to predict relative wire lengths before placement. A circuit is modeled by a graph with cells corresponding to nodes, and nets are represented as cliques with connections for each pair of nodes in a net. A weight is assigned to each connection. If a net k is connected with d(k) nodes, then every connection c in this clique is assigned a weight given by
Other connection weighting methods have been discussed in [6] , but (1) produces the best results. For a pair of nodes (u, v), w (u, v) is a weight of the connection between them. x w (u, x) denotes the sum of all weights on connections incident to u. A relative weight of a connection incident to u is defined as a ratio of the weight of this connection over the weight of all connections incident to u, i.e.,
For a connection linking nodes x and y, the mutual contraction MC(x, y) is computed using
This measure allows us to predict the relative wire lengths of connections. Placers can be implemented using various methods and cost functions. Most placers try to minimize the total wire length. Mutual contraction is derived based on this assumption. In Fig. 5 , we show two graphs demonstrating the relationship between mutual contraction and distance among cells placed by two state-of-the-art academic placers, namely: 1) Dragon [14] and 2) FengShui [1] . The results for six Microelectronics Center of North Carolina (MCNC) benchmarks, namely: 1) bigkey; 2) apex2; 3) clma; 4) s38584; 5) frisc; and 6) ex1010, are combined and shown in Fig. 5 . First, we compute the contraction value for each connection, and then we perform placement. After placement, individual connection lengths are normalized by the chip dimension. In Fig. 5 , the x axis measures the mutual contraction values, whereas the y axis measures the normalized connection lengths. All benchmarks follow the same trend.
Both placers produce results in which the cell pairs with larger mutual contraction tend to be closer. For the cell pairs with smaller contraction, the variation of their distances is quite large.
From the placement results, we extracted the wire lengths and evaluated the correlation between the wire length and the contraction strengths. We refer to the nets with the top 30% highest contraction values as strong connections (Strong_Co). We compared the average wire length for strong connection nets (Strong_Co) to the overall average wire lengths (All_Co). For each benchmark, we normalized these lengths with respect to its chip size (half perimeter). These results are shown in Table I .
From Table I , we can see that the average strong connection length is only 4.58% of the chip dimension. However, the average connection length is about 38.06% of the chip dimension. The last column shows the wire length standard deviation for strong connections. We can see that the standard deviation for strong connections are also very small. These results show that contraction can give a good prediction of the node neighborhood. Our extensive experiments suggest that as long as a placer minimizes the total wire length, the mutual contraction as a wire length predictor is very effective.
III. PREDICTION STEP: NEIGHBORHOOD-AWARE DECAP ALLOCATION
We decide decap allocation based on noise and timing weights for each cell. If we predict that a cell may experience excessive power noise, we assign a larger noise weight to it, and consequently, we allocate more decap padding. We also estimate cell delay and interconnect delay. From the delay estimations and slacks, we compute cell timing criticality. If a cell has high timing criticality, we reduce its decap weight and decrease the allocated decap padding. Timing weights help us enforce timing constraints for the circuit.
A. Noise Weights
The likelihood that a cell might have a large amount of power noise is estimated by its average current and by the currents of its neighbors. Neighborhood prediction is important because even if a cell consumes much power but most of its neighbors are quiet, this cell is not likely to suffer from extensive power noise. The neighborhood is defined in terms of layout distance. In this case, the neighborhood cells act as decaps. Using the prelayout wire length estimates discussed in the previous section, we can predict the neighborhood of each cell.
Cell CC is a function of the cell's switching frequency and switching capacitance. Cell switching frequency can be estimated by feeding the circuit with input vectors and performing functional simulation. Another way to calculate the switching frequency is to calculate the switching probability. For a quick analysis, in our experiments, we use a probabilistic method as suggested in [13] .
Cell switching capacitance consists of a cell's intrinsic capacitance, input capacitance of fan-out nodes, and wire capacitance. The intrinsic and input capacitances can be obtained from the netlist. Wire capacitance is unknown before placement; thus, we use a simple statistical wire load model to predict it. The average lengths for nets of various degrees can be extracted from previous placements of similar designs. In our case, wire length statistics are averaged over all our benchmark circuits.
The expressions for computing cell CC are as follows:
where switch_freq(n) is the cell n's switching frequency, wcap(n) is the wire loading capacitance for n, fanout_cap(n) is the total input capacitance of n's fan-out cells, and ncap(n) is n's intrinsic capacitance. load_cap(n) is the total loading capacitance of n, and CC(n) is n's CC.
Recall that the connections whose mutual contraction values are among the top 30% are classified as strong connections that are expected to be short after placement. The cells connected by strong connections are expected to be in close proximity after placement. Those connections not classified as strong are deleted from the circuit graph and thus have no impact on neighborhood CC. We define that the zeroth-level neighbor of a cell n is the cell itself. The (i + 1)th-level neighbors of n include n's ith-level neighbors and all the nodes linked by strong connections to its ith-level neighbors.
We then compute the neighborhood CC (NCC). If a cell has a high NCC, we predict its power noise to be more serious. The neighborhoods and NCCs are defined for various levels. When using a higher level neighborhood, the neighborhood size will increase; thus, more neighbors of n will be involved when computing n's NCC. The zeroth-level NCC of n is its CC. Computing cells n's ith-level NCC involves the lower CC cells in n's ith-level neighbors. The NCC function is designed such that to compute consecutive levels of NCCs, a cell needs to remember only its first neighbors. This helps us save computation time and memory. The reason why we only consider those lower CC cells in n's neighbors during the computation is that those cells may act as decaps and will bring down the CC in that area. The NCC of cells in a low-noise neighborhood will quickly decrease; however, NCC for cells in noisy area will decrease less rapidly. This helps us filter out the noisy areas.
The ith-level NCC of a node n depends on the switching of its ith-level neighbors. In the first iteration, we compute the first-level NCC of every node from the initial cell CC values. Based on the first-level cell NCC results, we compute the second-level NCC of every node. Higher level NCC can be computed using the following iteration: Let B(n) denote the first-level neighbors of n.
i+1 is the set of nodes that are in B(n) and have ithlevel NCC not larger than NCC(n)
i . The (i + 1)th-level NCC of n is the average of the ith-level NCC of nodes in A(n) i+1 . The expressions for computing NCC(n) i+1 and A(n) i+1 are as follows:
For example, we show a small netlist in Fig. 6(a) . The edges drawn are all strong connections. The number beside each node is the CC. The cells involved in n's second-level NCC computation are shown in Fig. 6(b 1 and NCC(b) 1 are both smaller than NCC(n) 1 . As the NCC level increases and more low-CC cells are involved in the computation, the cell NCC decreases. If a cell has many low-CC neighbors, its NCC decreases rapidly. However, from this example, we can also see that although higher level NCC could involve many cells, those cells in the lower neighbors still play a significant role in the NCC computation.
The purpose of the NCC computation is to determine the noisy areas. The high-CC cells with few switching neighbors will be filtered out. Only the clustered high-CC cells will retain their high NCC.
The noise weight for a cell is computed from the normalized cell NCC. 1e is the neighborhood level to compute NCC. Suppose that max _NCC 1e is the maximum 1eth-level NCC over all the cells. The normalized cell NCC for n is computed by dividing NCC(n) 1e by max _NCC 1e . nw(n) is the noise weight for a cell n, i.e.,
In Section V, we experiment using different 1e and observe their impact on the distribution of noise weighting. We find that setting 1e = 4 leads to the best noise weight distribution.
B. Timing Weights
Besides considering the power-noise factor, we also need to account for the timing factor. If cells are timing critical, we do not add large decaps to them. Adding a large decap padding area to cells on a critical path may increase the distances between the cells and consequently increase the interconnect delay. The criticality of a cell is computed using its slack. slack(n) denotes the slack of a node n. The slack for each cell can be computed from its input signal arrival and required times. max _slack is the maximum slack of all the nodes. slack(n) is normalized by the max _slack. tw_ exp is the timing weight exponent. crit(n) is the criticality of a node. tw(n) is the timing weight of a node n. If a node has a smaller slack, it is more timing critical, and its crit(n) will be higher. The timing criticality of a node is computed as follows:
With bigger tw_exp, the criticality difference between the highly critical and noncritical cells will be larger. The same criticality function has been used in [8] . Based on their experiments, we set tw_exp to 4. The timing weight function is
C. Decap Allocation
The decap area weight function is the summation of the noise and timing weights. decap_weight(n) is the decap weight for a node n. T _S is the timing cost scale. tw(n) and nw(n) are both normalized to a value from 0 to 1. Setting T _S to a higher value will increase the timing weight and assign less decap to a timing-critical area. The decap weight function is
In Section V, we will evaluate the impact of setting T _S to different values. We allocate the decap area according to the node's decap weights. Since we use the standard cell flow, the cell height and decap height are both fixed. The total decap width (TDW) is computed by multiplying the total cell width (TCW) by the decap ratio (DR); thus, the TDW = TCW × DR. The default value for DR is 0.2. Bigger DR may reduce power noise more, but at a cost of increased chip area, increased power consumption, or degraded chip timing. The portion of decap allocated to a cell n will be the ratio of decap weight of n and the summation of decap weights for all nodes. d_width(n) is the decap width of a node n. The decap weight function is
D. Experiments
In this section, we demonstrate the results of our neighborhood-aware decap allocation algorithm. We use the benchmark circuit ex1010 in this demonstration. The quantitative results for all benchmarks will be shown in Section V.
First, we perform the neighborhood prediction and compute various NCC levels. Then, we do placement using Dragon. In Fig. 7 , we show the cell NCC distribution for various NCC levels. In this example, we set T _S = 0, thus showing only the effect of power noise. In Fig. 7(a) , we show the top 55% current-consuming cells with neighborhood level 0. Neighborhood level 0 means that in computing a cell NCC, only the current consumed by this cell is accounted for. We observe that in Fig. 7(a) , the upper-left and lower-left areas are very dense and could be power noisy. Other areas also have numerous highly switching cells. In Fig. 7(b) , we show the cell NCC distribution considering its first-level neighborhoods. We use the minimum NCC of those cells shown in Fig. 7(a) as the threshold value, thus showing in Fig. 7(b)-(d) only those nodes with NCC greater than the threshold value. In Fig. 7(b) , we can see that the cells in the right and center areas become sparser, which means that the number of high-NCC cells decreases in those areas. However, the power-noisy areas in the upperleft/lower-left corners are still dense and become more visible. Fig. 7(c) and (d) shows the results with neighborhood levels 2 and 4, respectively. As the NCC level increases, the sparse area becomes even sparser. The number of high-NCC cells keeps decreasing in those areas. This experiment shows that the iterative NCC computation scheme is effective for isolating the noisy areas. The high-CC cells with low-CC neighbors are filtered out. More decaps can be allocated to those expected noisy areas to reduce power noise.
The power grid simulation result is shown in Fig. 8 for the case where the NCC level is equal to 4. The power voltage is 1.8 V. The grid granularity is 20 × 20. The unit in the x and y dimensions is millimeters. There are several power pads in the middle and on the periphery of the grid. We assume that the chip switching frequency is 100 MHz. Grid node decaps and current profiles are determined as described in Section II-A. The worst grid voltage drop is recorded for each grid node. Fig. 8(a) shows the results when decaps are distributed uniformly for all cells and our weighting technique has not been applied. Fig. 8(b) shows the results of decaps distributed according to our prediction-based weighting method. We use the same cell placement for Fig. 8(a) and (b) ; thus, it is easier to compare the difference in power noise. We observe that in Fig. 8(a) and (b) , the biggest voltage drop occurs at the upper-left and lower-left parts of the chip, which is as predicted in Fig. 7(d) . The lowest grid node voltage is 1.66 V in Fig. 8 (a) and 1.72 V in Fig. 8(b) . The results show that our prediction-based decap weighting method is effective in reducing the power noise. Fig. 8(b) uses the same placement as Fig. 8(a) ; thus, this placement contains overlaps. Fig. 8(c) shows the result after running a new placement. This placement is legalized, and the power-noise reduction is similar to that in Fig. 8(b) . In this section, we show only part of the experimental results, and more results will be shown in Section V.
IV. CORRECTING STEP: GATE SIZING FOR POWER NOISE AND TIMING
After assigning decap padding to cells, we carry out placement and power grid analysis. There are several placers that attempt to spread out highly switching cells across the chip [3] , [9] . In our experiments, we use the publicly available academic placer Dragon [14] , which does not have the capability of spreading the frequently switching cells. After placement, long interconnect delays can be reduced by buffering or gate sizing to meet timing constraints and further reduce power noise. In this section, we describe gate-sizing algorithms to optimize power noise and timing. The first algorithm is based on an SLP, whereas the second algorithm uses a budgeting-based heuristic. Both gate-sizing algorithms take into account powernoise optimization.
A. Correcting Step: An SLP Optimization
The first algorithm uses an SLP technique, which solves an LP in each iteration. In each iteration, the coefficients of the LP are updated, and a new LP is derived for the next iteration. In each LP formulation, three types of constraints are considered, namely: 1) timing; 2) area; and 3) power noise. The objective of each LP is to minimize the total power consumption and reduce the power noise. We will discuss each type of constraint separately.
1) Timing Constraints:
The circuit is modeled as a graph G. Nodes in the graph correspond to the cells, and edges represent the source-sink relationships in the circuit. Note that the graph model employed here uses edges rather than connections as in Section II-B.
We model cell delay using a gain-based model. g u is the fanin arrival time of u, and d u is the node delay of u. The timing constraints are as follows:
When calculating node delays, we include the IR drop (IRD) effect on delay and loading capacitance. V chip is the ideal power supply voltage. V pn u is the actual supply voltage after power grid simulation. M u is the intrinsic cell delay. L u is the delay slope per unit of loading capacitance. S u is the gate size of a cell u. F O(u) is the set of fan-out nodes for a cell u. µ w is the size-1 input capacitance for a cell w. W (u) is the wire capacitance loading for a cell u. The node delay function is computed according to
The part of the equation in brackets is contributed by the traditional gain-based delay model (21). (V chip /V pn u ) is the delay scaling from supply voltage. In general, L u is not constant, but the load dependence of the delay can be assumed linear in the neighborhood of the size S u .
The nonlinear timing constraints such as those in (14) cannot be used directly in an LP formulation. We apply the first-order Taylor's expansion to transform (14) into a linear equation.
We calculate the derivatives of a node delay with respect to the gate size for all the fan-out nodes and the node itself. 
, and ((∂d u )/(∂S w )) can be computed using the following equations:
Since the linear approximation of (14) by (15) is effective only for the gate sizes close to the initial values, we add the gate-size change boundary constraints. S L i and S U i are the lower and upper bounds for the new gate sizes. GS_SCALE is the gate-scale limit allowed in each iteration. The gate-size boundary constraint is stated in (19), whereas the upper and lower gate-size bound can be calculated using (20), i.e.,
If we select a GS_SCALE that is too small, the number of SLP iterations will be large before the optimization converges.
If we select a GS_SCALE that is too large, the convergence will be very difficult. We perform several experiments to select the scaling value that can lead to efficient convergence. We use GS_SCALE = 1.2 as the default. The gate determined by the sizing algorithm should be in the range provided by the cell library.
2) Area Constraints:
In the gate-sizing optimization, we add constraints to guarantee that the summation of the gate and decap areas does not change after the optimization, so that the chip area remains the same. We try to avoid a large decap reallocation. Large decap area reallocation might cause displacement of a large number of cells, which, in turn, could affect design convergence. Our idea is to divide the chip area into several equal-sized blocks. The summation of gate area and decap area in each block stays the same during the sizing optimization. B is the set of all blocks. β u is the cell area increase ratio when its size increases by 1. C u is the decap padding area for the cell u. K i is the summation of the cell and decap areas in a block after the first placement. The area constraints are stated as follows:
3) Power-Noise Constraints: The effectiveness of a decap to reduce power noise depends on its size and distance from the power-noisy area. We need a sufficient amount of decap in the power-noisy area to reduce the noise. To handle the powernoise constraints, we divide the chip are into several equalsized blocks. The power-noise constraints guarantee that the summation of decaps in a block is greater than the summation of switch currents of all the gates in the block multiplied by a scalar value for power-noise improvement.
Suppose that γ u × S u is the average current drawn by a cell u. γ u can be computed using the cell switching frequency and loading capacitance. Z m is the largest ratio of block decap over block CC among all the blocks in the current solution. PN_IMP is an improvement factor for power noise. Z is the lower bound ratio between the block current drawn and decap in the optimization. Z is computed as a product of Z m and PN_IMP. The power-noise constraint is stated in (22), and the formula for Z is shown in (23), i.e.,
We set the default value of PN_IMP to 1.2, which means that the expected improvement of block decap over block CC is 20%. PN_IMP can be set higher for more improvement.
4) Gate-Sizing Formulation:
Constraints for the gate-sizing formulation include timing, power noise, and area. The optimization objective consists of two parts, namely: 1) the total power consumption and 2) the weighted total decap area summation. pc(S u ) is the power consumption for a cell u.
is the voltage drop experienced by a cell u. Those cells whose voltage V pn u differs from V chip will be assigned more decap. BAL is the balancing factor, which is computed as follows:
BAL is also a normalizing factor between the power and noise cost function, enabling them to be compared appropriately. NCOF is the noise weighting in the objective function. Its default value is 5 because we put greater effort on optimizing power noise. When NCOF increases, more optimization effort will be put on reducing power noise. The complete gate-sizing formulation, i.e., the gate-sizing optimization for timing and power noise, is as follows:
Subject to:
The gate-sizing objective function is shown in (25). Equations (26) and (27) capture the timing constraints. Equations (28) and (29) state the area constraints and the power-noise constraints, respectively. For C u in (25), we use the weighting function of (V chip − V pn u ) 2 because we want to assign decaps closer to the cells with larger voltage drops. We will discuss this problem further in Section V.
After setting up the initial LP formulation and solving it, we obtain a new gate-size configuration that can improve the LP objective function. Using the new solution, we update the coefficients of the linear equations and solve the LP problem again. We can continue this iteration until the optimization converges. The SLP iteration is stopped when improvement becomes insignificant. In our implementation, if the total decap area increment in the current iteration is less than 10% of the decap area increment in the previous iteration, the SLP optimization is stopped. In the experiments, we will evaluate the improvement gained when applying different numbers of iteration.
B. Correcting Step: Budgeting-Based Heuristic
The SLP-based gate-sizing algorithm can produce very high quality results if we continue the iteration. Although SLP is efficient, the runtime might still be too high for big circuits. In this section, we propose a heuristic gate-sizing algorithm that takes timing, power noise, and CC into account and that can achieve good results in a short time. In the following paragraphs, we discuss the case in which the critical path timing constraint is larger than the current critical path delay (CritP). In this case, we need only to downsize the gates. For the case in which the current CritP exceeds the path delay constraint, we can first uniformly increase the size of every gate until the path delay constraint is satisfied. Next, our gate-sizing heuristic can be applied to reduce the gate sizes.
The gate-sizing heuristic is based on an iterative scheme. In each iteration, we slightly resize the gates according to weights assigned to them. We first compute the timing, power noise, and CC weight for each cell. The power-noise weight pn(n) and the sizing weight sizing_wgt(n) are computed as follows:
where ic(n) is the CC of n, and timing weight tw(n) is from (10) . We define a cell gate level as the maximum level of gates for all paths from primary inputs or flip-flops to this cell. To guarantee that the resized gates will not cause timing violations, we resize the gates level-by-level following the reverse gatelevel order. As we resize the gates, the new cell-required-time will be updated. We make sure that the increase of delay is less than the cell original stack. For example, as shown in Fig. 9 , cell a is at a gate level (i − 1), and cells b and c are at the level i. The original arrival and required times for a are 5 and 6, respectively. The original slack of a is 1. If we reduce the size of a, its delay will increase, whereas its required time will decrease. The maximum delay increase for a will be equal to its slack, which is 1. In our program, we define the amount of cell delay increment budget rbgt(n) as the minimum of the cell slack and cell sizing weight multiplied by a cell delay increment unit INU, i.e.,
If we assign INU to a large value, cell-required-time will decrease quickly in the first few reverse levels, and only cells in those levels will be resized. However, if we assign INU to a value that is too small, we will need many resizing iterations to finish the optimization. From our experiments, we observe that setting INU = 0.1 ns (which is a value of about the same order as the cell intrinsic delay) can strike a good balance between the runtime and quality.
After the node sizing weight is computed, we update the cell delay and arrival times. Then, we check to see if there is room for gate-sizing optimization. This is done by noting whether the reduction of a total slack in this iteration is greater than 10% of the slack reduction of the previous iteration. If the criterion for improvement is satisfied, we will continue the optimization; otherwise, the algorithm stops.
After the optimization, many cells may have smaller sizes. We increase and relocate decaps in each partition area according to the updated CC, i.e., ic(n). The partitions are as described in Sections IV-A2 and IV-A3. The reason for relocating decaps only within a partition is to reduce the circuit performance disturbance.
The flow of the heuristic gate-sizing algorithm is shown in Fig. 10 .
V. EXPERIMENTS
We conduct our experiments using 0.18-µm technology. Several middle-and large-sized benchmark circuits are selected from the MCNC benchmark suite. Columns 1 and 2 in Table II show the circuit information. Benchmark circuits have sizes ranging from 4199 to 23 362 cells. Columns 3 and 4 in Table II show the number of grid nodes and power pads for each circuit, respectively. TCW denotes the summation of all cell widths. For each benchmark, the available TDW is given as a percentage of the TCW. We will experiment with varying total decap percentages. Since we assume a standard cell design style, the heights of the cells and decaps are the same. The sum of the decap and cell areas defines the total chip area. Circuits are placed using the fixed-die mode in Dragon. The default chip voltage is 1.8 V, and the voltage margin threshold is 5% of the ideal voltage. The experiments are run on a Linux Intel 2.4-GHz machine. Fig. 11 shows the experimental flow. We first run SIS [16] technology mapper with optimization objective for timing performance. SIS also does gate sizing during synthesis. Based on the netlist characteristics of the input circuits, we perform the decap allocation prediction using the algorithm discussed in Section III. Afterward, we change the cell widths to include decap padding and perform the placement. We do not need to modify the placer to take decap allocation into account. After placement, we update wire capacitance and gate delay and then perform the power grid analysis. Next, we determine voltage drops for all cells, update cell delay according to the new grid voltage, and do timing analysis with the new node delays. These are the results after the prediction step, which form the input for the gate sizing. After the sizing optimization, we again perform the grid analysis. Cell delays are also updated to reflect the new grid voltages, and then, we perform timing analysis. These are the results after power-noise correction.
A. Prediction Scheme Evaluation
To evaluate the decap prediction methods, we conduct experiments applying various strategies. First, we allocate no decap to cells (NOC). Second, we distribute evenly decaps to all cells (EVEN). Third, we perform the prediction-based decap allocation (WGT) ignoring timing cost T _S = 0. Fourth, we perform prediction-based allocation including timing cost (T _S = 1). When T _S = 1, the decap allocation considers noise and timing weights as equally important. A graphic illustration of different strategies is shown in Fig. 12 .
The experimental results are shown in Table III . T _S is the timing cost scale in (11) . DD denotes various methods of Table III , we can see that for T _S = 0, the power noise, timing, and total slack results are all improved when DD changes from NOC to EVEN and to WGT. Comparing the cases of EVEN and WGT, the IRD decreases by 27%, the SENA decreases by 51%, and the vioC decreases by 28%. This shows that our prediction-based decap allocation method is effective, and decaps are useful in reducing power noise. The timing also improves because voltage drop decreases and node delays become shorter. When we increase timing weights and change T _S from 0 to 1 using prediction weighting (WGT), timing results improve by 4%; however, power-noise results become worse. The timing improvement is only minor when increasing the timing scale T _S. Table IV shows the total wire length comparison for four cases, namely NOC, EVEN, (T _S = 0, WGT), and (T _S = 1, WGT). The last row shows the normalized wire lengths. For each benchmark, the chip size is the same for all experiments. The wire length in NOC is smaller than in other experiments because with the absence of decaps, cells can be placed closer. For the other three experiments, the total wire length results are similar.
B. DR Effect Evaluation
In the experiment reported in Table III , we use DR = 0.2 of the total cell area. It is interesting to observe how the DR affects power noise. We conduct additional experiments using DR = 0.1 and 0.3. We obtain placement from experiments in Table III , scaling the chip width accordingly to scale the decap area. The numbers of rows and columns in the power grid do not change. We experiment with the case T _S = 0 and the decap allocation methods EVEN and WGT. The normalized average results from all benchmarks are shown in Table V . Those results are normalized to the case T _S = 0 and the EVEN DD in Table III . From the results, we can see that as the DR increases, the power-noise results improve, although the timing results degrade slightly. Comparing DR = 0.3 and 0.1 at DD = WGT, the IRD reduces by 23%, the SENA improves by 91%, the vioC improves by 65%, and the timing degrades by 2.8%.
C. NCC Level Effect Evaluation
The power-noise results depend also on the NCC levels and how the neighbors of a cell are predicted. If too few NCC levels are used, many decaps will be allocated to those cells having a high CC but a small neighborhood CC. However, since such cells are unlikely to suffer from a power-noise problem, they should not be allocated decaps. If its NCC levels are too large, a neighborhood will cover too much chip area and will lose its meaning. According to (6) and (7), NCC computation depends strongly on a cell's neighbors at a particular level. The effect of remote neighbors on a cell NCC is small. When the NCC level exceeds a certain value, faraway neighbors will not have significant impact on a cell NCC. In the first four rows of Table VI, we show the results when using varying NCC levels. The results for all the benchmarks are averaged and normalized with respect to the case of NCC level being equal to 4. We show the results for NCC levels 0, 1, 4, and 8. From the results, we can see that NCC level 4 gives the best results. When NCC levels are too small or too great, the power-noise results become worse.
As stated in Section III-A, we consider those connections whose mutual contraction comes within the top 30% to be strong. We computed cell neighborhoods based on those strong connections. We also experimented with differently defined neighborhoods. For example, instead of using only the strong connections to determine neighborhoods, we used all the connections. As long as there was a connection between a pair of nodes, we considered them to be first-level neighbors. In this case, cell NCCs could be influenced by cells that have been placed far away. In Table VI , we refer to the case of using only strong connection as Strong_Co. The case for using all connections to find neighbors is referred to as All_Co. From the results, the NCC level 4 with neighborhood defined by strong connections gives much better results than the NCC level 4 and neighborhood defined by all connections. When using strong connections, the IRD improves by 30%, the SENA improves by 3.1 times, and the vioC improves by 1.7 times. Using too few NCC-levels may not cover enough neighbors and may not capture the neighborhood effect on power noise. However, using too many NCC levels may cover too much area, which, in turn, may increase the estimation errors. Therefore, choosing a good neighborhood size is important.
D. Grid Design Effect Evaluation
The power grid design also has a big impact on power noise. In this experiment, we show noise results for various granularity grid and pad designs. Gnode# denotes the number of grid mesh nodes. PAD# denotes the number of power pads. ×1 refers to using the original design. ×4 stands for increased node or pad count by four times. For this experiment, the total grid area is fixed. If we use twice the number of the vertical and horizontal grid lines, the number of grid nodes increases by four times. The width of the power grid lines will be reduced by half, and their 
E. Results After Power-Noise Correction
Table VIII shows the experimental results after power-noise correction. The second column shows various types of optimization. si0 denotes the experiment in which T _S = 0, DD = WGT for weighting prediction, and there is no SLP gatesizing optimization after placement. si2 and si4 denote the experiments in which we run two and four SLP gate-sizing iterations after placement, respectively. si * denotes the case in which we repeat SLP iterations until the stop criterion is met. hur denotes the results for the heuristic sizing algorithm.
TCC is the total chip CC. RT is the runtime for the SLP and hur optimization. The last five rows are the normalized average results for all the circuits with respect to the case si0.
We observed that as we conducted more SLP optimization iterations, the results showed improvements in power noise, timing, and chip power consumption. The decap width increased substantially. Note that the summation of the decap width and cell width remains fixed. The TCW decreases by the same amount as the TDW increases. Comparing si0 with si * , voltage drop improves by 43%, SENA becomes almost 0, decap area increases by 3.4 times, and CC improves by 43%. The results from the heuristic sizing algorithm are close to si * and require much less runtime. The heuristic iteration number ranges from 18 (for bigkey) to 71 (for clma). The average of si * iteration number is 15. Fig. 13 shows the voltage profiles for different optimization schemes at a grid node for the benchmark bigkey. In this experiment, T _S is set to 0; thus, the optimization targets only the power-noise reduction. WGT-si2 denotes using predictionbased weighting (WGT) and two iterations of SLP. si0 denotes the case with no SLP optimization. We can see that the voltage drop decreases as we do more SLP optimization and use our prediction-based weighting scheme.
F. Voltage Drop Profile
G. Gate-Sizing Optimization Function Evaluation
To justify the use of a quadratic weight function
2 for C u in (25), we collect data from the following If we use a linear function, many instances will have similar power-noise weights, and the effect of decap insertion will be worse than in the case of a quadratic function. For a higher power function, the difference of power weighting might be too large, and decap insertion may be too greedy. Based on experiments, we observe that quadratic weighting function yields the best voltage drop distribution.
VI. CONCLUSION
In this paper, we addressed the power-noise problem considering timing constraints. Decap padding was added to each cell. We proposed a decap allocation flow, which consists of prediction and correction steps. First, we allocated decap to each cell based on predictions. Decaps were allocated to cells that were most likely to have large voltage drops. We also considered timing criticality in decap allocation, so that the added decap area would not increase the CritP. A possible extension to improve timing is to include timing weights in the neighborhood decap computation. Currently, our neighborhood-based decap allocation method only considers power weighting. For a cell in the critical path, its neighbors should have larger decap padding to reduce regional voltage drop. The delay for that critical cell can be decreased if its voltage drop is reduced.
In the decap correction step, we performed gate sizing and reallocated decaps based on a more complete information after placement and power grid analysis. For gate-sizing optimization, we proposed two algorithms. The first algorithm is based on the SLP methods, and the second is a heuristic.
Both gate-sizing algorithms assume continuous sizing. However, in practice, it may not be possible. Our algorithms need to be adapted to discrete gate sizing. Our gate-sizing algorithms also require interaction with grid simulation. After the gate sizing is done, we perform power grid simulation again and determine the new voltage drop profile. If necessary, gate sizing may be repeated. We think that the iteration scheme between gate sizing and simulation is more practical than trying to solve both problems together because of computational complexity.
Comparing the results achieved for uniformly assigned decaps against decaps assigned using our prediction-based weighting method, our method shows that the maximum voltage drop decreases by 27%, the SENA decreases by 51%, and the grid voltage violations decrease by 28%. We found that changing the timing weight from 0 to 1 improved timing by 4%. The power-noise results also showed improvement when the DR was increased. Comparing DR equals to 0.3 and 0.1 using prediction-based weighting (WGT), the IRD reduces by 23%, the total excess noise improves by 91%, the vioC improves by 65%, and the timing degrades by 2.8%.
Cell power noise was also affected by the extent of neighborhood levels that we used and how we defined neighbors. Our results showed that using contraction-predicted neighborhoods can produce 30% better results on IRD rather than just using connections to estimate the neighborhoods. Different grid designs were shown to have impact on power noise. Power noise appears to be especially sensitive to power pad numbers.
For postlayout gate sizing, when using SLP, comparing si0 with si * , voltage drop improves by 53%, total excess noise becomes almost 0, decap area increases by 3.4 times, and CC improves by 53%. The gate-sizing heuristic algorithm produces results similar to si * ; however, the runtime is one order of magnitude less. Even our biggest benchmark can be finished in 51 s.
These results show that our techniques are very effective and efficient for power-noise reduction. For future more advanced design, power noise will become a more severe problem. Although a good decap allocation scheme is an important part of reducing power noise, other steps like grid design, package design, and placement are also important. A more integrated approach is necessary to maintain power integrity. In advanced designs, leakage power becomes a serious problem. Decaps contribute to leakage. In a case when leakage power is a limiting factor, our decap allocation method can be used to spread out highly switching cell clusters. Our method can be applied without decap insertion and can still improve power noise.
