Abstract
Introduction
Performance optimization has always been a critical step in the design of integrated circuits. Process technology scaling has made interconnect performance more dominant than transistor and logic performance. With the continued scaling of process technology, the interconnect resistance per unit length continues to increase, the capacitance per unit length remains roughly constant and logic delay continues to decrease. These trends have caused interconnect delay to become more dominant than logic delay. Process technology options, such copper wires, can only provide temporary relief. The trend of increasing interconnect dominance is expected to continue.
Interconnect-driven timing optimization techniques, such as wire sizing, buffer insertion and gate sizing have gained widespread acceptance in deep submicron design [7] . In particular, buffer insertion techniques have been successful in reducing interconnect delay. To the first order, interconnect delay is proportional to the square of the length of the wire. Inserting buffers effectively divides the wire into smaller segments, which makes the interconnect delay almost linear in terms length (plus the buffer delays). Additional advantages of buffer insertion will make this optimization even more pervasive as the ratio of device to interconnect delay continues to decrease.
Several works study delay-driven buffer insertion. Closed formed solutions for 2-pin nets are proposed in [1] [4] [5] and [9] . In [16] , Van Ginneken develops a dynamic programming algorithm which finds the optimal buffer placement under the Elmore delay model [10] . In [12] , Lillis et al. extends this algorithm to simultaneously perform wire sizing, while also minimizing the total number of buffers. Finally, Alpert and Devgan [1] propose a wire segmenting pre-processing algorithm to handle the one buffer per wire limitation of Van Ginneken's algorithm, which results in a smooth trade-off between solution quality and run time.
Although timing optimization has always been critical in the design process, present day design techniques and process technologies are making noise analysis and avoidance as important. The shrinking of the minimum distance between adjacent wires has caused an increase in their coupled capacitance. Furthermore, as the ratio of wire thickness to width continues to increase, so will the ratio of coupling to total capacitance. Coupling capacitance can cause a switching net to induce noise onto a neighboring net, resulting in an incorrect functional response. Further, the widespread use of dynamic logic circuits has made noise avoidance even more critical since these logic families are more susceptible to noise failure. It is no longer sufficient or even acceptable to optimize only for delay. Noise avoidance techniques must become an integral part of the performance optimization environment. Buffer insertion provides a suitable platform for optimizing both timing and noise. Figure 1(a) shows the noise effect that an aggressor net (top) can have on a victim net (bottom). The coupling capacitance may cause an input signal on the aggressor net to induce a noise pulse on the victim net. If the resulting noise is greater than the tolerable noise margin of the sink, then an electrical fault results. Figure 1(b) shows how inserting a buffer can distribute the capacitive coupling between the two newly created wires, resulting in smaller noise pulses on the input of the inserted buffer and on the sink. If the amplitude of these noise pulses are less than he noise margins for the sink and the buffer, then the circuit will function correctly.
Noise analysis is typically performed through detailed circuit simulation or through reduced order interconnect analysis (e.g., AWE [13] and RICE [15] ). Although the latter is more efficient, it is still too slow to be used within an optimization tool. Instead, we adopt the noise metric of [8] .
The rest of the paper is as follows. Section 2 presents notation and definitions. In Section 3, we derive a formula for the maximum wire length such that no noise violation is induced and also present two optimal algorithms for noise (a) (b) avoidance. Section 4 presents a third algorithm for minimizing delay such that all noise constraints are satisfied. Finally, Section 5 presents experimental results.
Preliminaries
A routing tree contains a set of wires and a set of nodes where is the unique source node, is the set of sink nodes, and is the set of internal nodes. A wire with length is an ordered pair of nodes in which the signal propagates from to . Each node has a unique parent wire . The tree is assumed to be binary, i.e., each node can have at most two children. 1 Let the left and right children of be denoted by and respectively. Assume that if has only one child, then it is . The path from node to , denoted by , is the set of wires that connect to . We are also given a buffer library .
A buffer insertion solution is a mapping which either assigns a buffer or no buffer, denoted by , to each internal node of . 2 Let denote the number of internal nodes with inserted buffers. Wires are segmented as in [1] to create as many internal nodes as necessary to form a reasonable set of buffer locations. Assigning buffers to induces nets, and hence subtrees, each with no internally placed buffers. For each , let , the subtree rooted at , be the maximal subtree of such that is the source and contains no internal buffers. Observe that if , contains only one node. 
Delay Optimization
The Elmore delay for a wire is given by . The delay through a gate is given by . If , then . The total delay from to is given by .
Each sink has a given required arrival time , and assume that the input signal arrives at the source node at time zero. The condition must hold for the circuit to meet timing requirements. For every , let be the slack at where is the set of all sinks that are downstream from . Observe that the circuit meets its timing if and only if .
1 A non-binary tree can be converted into a binary tree by inserting wires with zero resistance and capacitance where appropriate. 2 A buffer placed on an internal node with degree is interpreted as having one input, one output, and fanouts.
Noise Avoidance
In [8] , Devgan proposes a coupled noise estimation metric which is an upper bound for RC and overdamped RLC circuits. The metric depends on the resistance of the victim net, the resistance of the gate driving the victim net, coupling capacitances to the aggressor nets, and the rise times and the slopes of the signals on the aggressor nets. For example, consider the three aggressor nets and the single 2-pin victim net in Figure 2 . The wire in the victim net is segmented into seven new wires such that each new wire is completely coupled to either 0, 1 or 2 of the aggressor nets. The coupling capacitance from an aggressor net can be modeled as some fraction of the wire capacitance of the victim net. Given simultaneously switching aggressor nets near wire , let be the ratios of coupling to wire capacitance from the aggressor nets to , and let be the slopes (i.e., power supply voltage over input rise time) of the aggressor net signals. The total current induced by the aggressor nets on is
Often, information about neighboring aggressor nets is incomplete, especially when buffer insertion is performed before routing. When performing buffer insertion in estimation mode, one might assume that (i) each wire is coupled to exactly one aggressor net, (ii) the slope of all aggressor nets is , and (iii) some fixed ratio of the total capacitance of each wire is due to coupling capacitance. Under these assumptions for each wire .
Let be the total downstream current seen at , i.e., .
Each wire adds to the noise induced on the victim net. The amount of additional noise induced from a wire is given by . The total additional noise seen at a sink starting at some upstream node is given by (5) where if there is no gate at . The path from to has no intermediate buffers. If an intermediate buffer is present, the noise computation begins from the output of the buffer, since the buffer is a restoring stage. Each node has a predetermined noise margin . The condition , , must hold for there to be no electrical faults. We define the noise slack for every as . Noise slack serves equivalently as a noise margin for internal nodes. Noise constraints for downstream sinks in are satisfied
I e e I e C e λ j μ j ⋅ ( ) 
Problem Formulations
We study two different buffer insertion problems. The first problem seeks to fix all noise violations with the fewest possible buffers. Delay is not considered Problem 1: Given a tree , a buffer library , and noise margins for each , find a solution which minimizes , such that for each .
This problem may be useful for non-critical nets, for which delay optimization is unnecessary. For timing-critical nets, we must consider both noise and delay at the same time.
Problem 2: Given a tree , a buffer library , and noise margins for each , find a solution which minimizes such that for each .
A third formulation can seek to minimize the total number of buffers inserted by while satisfying both noise and timing constraints. Algorithm 3, which is used to solve Problem 2, can also be applied to address this third formulation using an extension of Lillis et al, [12] to Van Ginneken [16] .
Noise Constrained Buffer Insertion
We begin with the simplest case of a wire with uniform width and neighboring coupling capacitance, as shown in Figure 3 . For each wire , let and respectively be the wire resistance per unit length and the current per unit length. Since current is a constant times wire capacitance, we can use a -model to represent its distribution. Theorem 1 For a given wire in a routing tree , a buffer needs to be inserted on to satisfy noise constraints if and only if .
Proof: For noise constraints to be satisfied, we must have which is a quadratic in . Solving for yields the theorem. Theorem 2 A net that has been optimally buffered to minimize delay alone may be susceptible to noise violations.
Proof: Consider a wire in which and are gates in the buffered net (a similar analysis holds for a path from to ). Let be the slope of an aggressor net, and let be the ratio of coupling to total capacitance for . If there is a noise violation at , then the following must hold. (7) Solving for yields .
For any fixed values for the parameters on the right side of the inequality, a noise violation will occur if the noise margin is small enough. Even if is reasonably large, an aggressor net can have a very large slope and a high coupling ratio which would also cause a violation.
Noise Avoidance for Single-Sink Trees
Theorem 1 suggests how to insert buffers for single-sink trees. Begin at the sink and work up the tree, updating the total downstream current and noise slacks of visited internal nodes. At each internal node, use Theorem 1 to decide if a buffer should be inserted. The algorithm terminates when the source node is reached. Algorithm 1, Noise Avoidance for Single-Sink Trees, is presented in Figure 5 . The algorithm accepts a routing tree, and a single buffer type .
Step 1 initializes the current and noise slack of the sink node, then Steps 2-4 climb up the tree visiting each node in turn.
Step 3 examines whether or not a buffer needs to be inserted on the current wire , by computing the noise from placing a buffer at . If this noise is less than the noise slack, no buffer needs to be inserted, so the algorithm computes the downstream current and noise slack for node , then moves to the next wire. But, if the noise is larger then the noise slack, then a buffer must be inserted.
Step 4 computes the maximum length that this buffer may be inserted from , and inserts it there at a new internal node . Finally, Step 5 computes the noise slack at the driver and inserts a buffer right after the driver if there is a noise violation (which can only occur if ). Optimality follows from the fact that buffers are always inserted their maximal distance up the tree, according to Theorem 1. 
Noise Avoidance for Multi-Sink Trees
Some difficulty arises in extending Algorithm 1 to multiple sinks. Let , , and respectively be the wire, current and noise slack for the left (right) branch of an internal node with two children. It is possible that , and , i.e., the noise constraints for the left and right branches are satisfied but merging the left and right branches will induce a noise violation. Thus, a buffer must be placed on either the left or right branch immediately after . One cannot immediately deduce which branch to choose since one needs to know the characteristics and location of the gate driving . Since the algorithm is bottom-up, the location of this gate has not yet been determined.
We propose to generate a set of candidate solutions for each node and propagate these candidate solutions up the tree, in the same spirit as in Van Ginneken's algorithm [16] otherwise. Whenever a node with two children is encountered and a buffer needs to be inserted, both a left and right candidate is generated. The candidates are sorted in non-decreasing order by downstream current so that inferior solutions can be pruned [16] . Given two candidates and , is inferior to if and only if and .
Algorithm 2 is similar to Algorithm 1, except for Steps 4-7 which handle nodes with two children.
Step 4 iterates through each candidate for the left branch and each candidate in the right branch using the Van Ginneken's linear merging technique.
Step 5 tests whether merging the two candidates results in a noise violation, and if not, Step 7 merges the two sets of candidates without inserting a buffer.
If there is a violation, then two new solutions, one with a buffer on the left branch and one a new buffer on the right branch, are generated and inserted into the current list of candidates. When the algorithm terminates, the solution(s) in with the fewest number of buffers is chosen.
The algorithm returns an optimal solution to Problem 1 in time quadratic in . As for Algorithm 1, one can obtain an optimal solution for a buffer library with multiple buffer types by selecting the buffer type with least resistance.
Optimizing Noise and Delay
To address Problem 2, we modify the approach of Van Ginneken [16] to include noise avoidance. Whenever Van Ginneken's algorithm considers inserting a buffer, we check the noise constraints; if they have been violated, the buffer is not inserted. Hence, our algorithm generates fewer solutions than Van Ginneken's algorithm since it prunes solutions which have noise violations.
As before, a list of candidates is computed for each node, except that a candidate is now a 5-tuple where is the load seen at , is the slack at , is the downstream current seen at , is the noise slack at , and is the current solution. Figure  7 illustrates Algorithm 3, which is the same as Van Ginneken's 3 except for the modifications in boldface.
Step 1 of Figure 7 segments the wires to generate sufficient possible buffer locations.
Step 2 calls Find_Cands which 3 The algorithm can be modified to handle inverting buffers [12] .
Input:
≡ Current node to be processed Output:
≡ List of candidate solutions for node Globals:
≡ Routing tree ≡ Buffer library 
Set
Create internal node (with parent and child ) on wire at distance from . , 11. Prune of inferior solutions and return e l e r ( ) 
returns a list of candidate solutions.
Step 3 adds the driver delay and computes the noise slack, then the candidate with the best timing slack, such that noise constraints are satisfied, is returned in Step 4.
The Find_Cands procedure shown in Figure 8 . It takes the node as input, recursively computes the lists of possible candidates for all the nodes in , and then returns the candidates for . Find_Cands consists of four main parts:
• Steps 1-4 constructs candidates for the children of and merges them to form , the set of candidates for .
• Step 5 inserts considers each buffer type in the library and adds the buffer which yields the largest slack such that noise constraints are satisfied. A buffer will not be inserted if there is a noise violation. This step is the fundamental difference between Algorithm 3 and Van Ginneken's algorithm [16] .
• Step 6 computes the new load, slack, current and noise slack for each candidate induced by the parent wire of .
• Finally, Step 7 prunes inferior candidates from , using the pruning schemes of [12] [16]. Modifications for noise avoidance do not increase the time complexity of Van Ginneken's algorithm since the modifications are mostly constant time operations for bookkeeping and pruning.
Theorem 3
If and , then Algorithm 3 returns an optimal solution to Problem 2.
Proof: Refer to [2] for a proof. With multiple buffers in the buffer library, the optimality of Algorithm 3 is no longer guaranteed. However, in practice, Algorithm 3 generates solutions that are very close to optima; our experimental results in the next section strongly support this claim.
Experimental Results
The algorithm we use in our experiments is an extended version of Algorithm 3, called BuffOpt, which can trade-off between delay reduction, noise avoidance, and the total number of buffers [12] . We refer to the algorithm which optimizes only delay [1] [12] [16] as DOpt. DOpt is the same as Algorithm 3 but without the boldface text. For our experiments, we selected a set of 500 critical nets from a modern Power PC microprocessor design. Table 1 shows the distribution of the sizes of these nets. To verify BuffOpt, we also ran a detailed, simulation-based noise analysis tool, called 3dnoise [14] . 3dnoise was run both before and after BuffOpt and DOpt.To perform noise analysis, 3dnoise uses accurate moment-matching based techniques that are similar to RICE [15] . We ran BuffOpt, DOpt and 3dnoise all in estimation mode (see Subsection 2.2) assuming a 0.7 coupling to total capacitance ratio from a single aggressor net with rise time 0.25 nanoseconds and a power supply voltage of 1.8V. The tolerable noise margin for every gate was set to 0.8V. The buffer library contained 5 inverting and 6 non-inverting buffers of varying power levels. Our experiments show that
• BuffOpt eliminated all noise problems in the design, • DOpt could not fix all noise problems, and • the average delay penalty from using BuffOpt instead of DOpt was less than 2%.
BuffOpt Successfully Avoids Noise
We ran BuffOpt on the 500 nets, and observed that BuffOpt identified 423 noise violations and successfully inserted buffers to fix all of them. To verify BuffOpt's identification of noise violations, we ran 3dnoise on the nets before running BuffOpt. The accurate analysis of 3dnoise identified 386 nets with noise violations, all of which were also identified by BuffOpt. BuffOpt identified 423 -386 = 37 more nets with violations, which shows that the noise metric [8] is slightly conservative. We also ran 3dnoise on the nets after running BuffOpt, and 3dnoise identified no noise violations. This data is summarized in Table 2 . After BuffOpt 500 423 0
Optimizing Delay Alone is Insufficient
We now compare BuffOpt to DOpt (optimal delay-driven buffer insertion) in terms of noise avoidance. Since BuffOpt never inserted more than four buffers on any net, we ran DOpt four times in which no solution was allowed to have more than buffers where ranged from 1 to 4. We denote one such run of DOpt by DOpt(k). Table 3 compares DOpt(k) with BuffOpt. TBI stands for "total buffers inserted" (i.e., times the number of nets with buffers) 4 , and #NVs stands for "number of noise violations".
Observe that running DOpt(4) causes the addition of 1126 more buffers than BuffOpt, yet there are still 13 noise violations. Reducing the maximum number of buffers that can be inserted by DOpt only increases the number of noise violations and still causes more buffers to be inserted for . Thus, as shown by Theorem 2, noise avoidance must be integrated into the algorithm to guarantee no noise violations. Finally, observe that BuffOpt uses less CPU than DOpt(k) for . This occurs because BuffOpt prunes candidates with noise violation, which gives BuffOpt fewer total candidates to analyze.
The Delay Penalty is Small
Finally, we compare DOpt to BuffOpt in terms of total delay. We first ran BuffOpt on the 500 nets and recorded the buffers inserted for each net. We then ran DOpt for the same number of buffers as BuffOpt inserted in order to make an apples to apples comparison. We computed the reduction in total delay for each net and averaged the results by the number of buffers inserted. The cumulative results are presented in Table 4 . For example, there were 232 nets for which two buffers were inserted, and on average BuffOpt reduced delay by 336.0 picoseconds while DOpt reduced delay by 338.2 picoseconds. The overall delay penalty is percent. The average delay reduction over all 423 nets is computed by taking the total delay reduction for all nets and dividing by 423. 5 Observe from the last column that the overall average delay penalty is only 6 ps, or equivalently 1.99%, from avoiding noise. Thus, Buffopt is able to integrate noise into a delaydriven algorithm with virtually no loss in total delay.
Recall that Algorithm 3 is optimal when the buffer library contains a single buffer type, but we could not guarantee optimality for a larger buffer library. The DOpt results in Table 4 form an upper bound on the optimal solution to Problem 2 since DOpt is optimal for delay alone. Observe that even with a buffer library of size 11, BuffOpt solutions are virtually optimal since they are on average within 2% of an upper bound. 
